Re: [chuck-users] Getting Started with ChucK
On 21 Jul 2009, at 20:30, Tom Lieber wrote:
2009/7/21 Hans Aberg
: So it might be better to tweak some existing functional language into accepting Chuck's timing concepts.
Want to help with ruck? :D
http://github.com/alltom/ruck/tree/master
It all works and has nice sugar like chucking arrays of objects and lambda properties, but it's too slow for real-time at reasonable sample rates. I haven't sat down to do the porting to C I think it needs to perform well, even though I think it shows promise.
I looked at Ruby, and didn't like some aspects of it, though I do not remember what it was :-). But there is a problems with the GC. The Wikipedia page did not specify. Pretty much the only GC that can fulfill the requirements of exact timing without jumps and concurrency is reference counting. Experts will ensure that this need not be so, but also admit that it is hard to achieve in practice. So I think Chuck, if extended with lambda calculus might try adding it on top what already exists, and then just use reference counting for GC. Hans
On Tue, Jul 21, 2009 at 3:06 PM, Hans Aberg
But there is a problems with the GC. The Wikipedia page did not specify. Pretty much the only GC that can fulfill the requirements of exact timing without jumps and concurrency is reference counting.
How slow is garbage collection? No ChucK code executes with exact timing even now. It's on the user to spread out computation to avoid delays, just as it would be on them to mind their allocations to avoid GC delays. Concurrent GC would be nice, but making the VM multi-threaded is not the only way to synthesize sounds faster. I'm more interested in bringing ChucK ideas to other languages than the other way around right now. Other VMs have had a few more eyes and hands on them, and adding timing rules to them has been easier than adding language features to ChucK for me. I'm leaving that to the pros! -- Tom Lieber http://AllTom.com/
On 22 Jul 2009, at 00:06, Tom Lieber wrote:
But there is a problems with the GC. The Wikipedia page did not specify. Pretty much the only GC that can fulfill the requirements of exact timing without jumps and concurrency is reference counting.
How slow is garbage collection? No ChucK code executes with exact timing even now. It's on the user to spread out computation to avoid delays, just as it would be on them to mind their allocations to avoid GC delays.
Concurrent GC would be nice, but making the VM multi-threaded is not the only way to synthesize sounds faster.
There is a thread in the Usenet newsgroup comp.compilers about reference counting that discusses some of these things. The moderator got bored with all these GC discussions - so one aspect of it is that one may end up implementing GCs rather than a programming language! :-) The problem is not getting the GC fast enough, but they tend to do collecting at a specific time, when the whole program sort of halts. This would easily exceed the few tens of a second that would be acceptable in a live performance with Chuck - even that would unacceptable. One might try to avoid that by having GC run in a separate thread, but there are problems with that, too (don't remember exactly). One interesting thing that came up was that a tracing GC may need to trace swapped out virtual memory, which when swapped back into memory causes delays - the hard drive is a bottleneck.
I'm more interested in bringing ChucK ideas to other languages than the other way around right now. Other VMs have had a few more eyes and hands on them, and adding timing rules to them has been easier than adding language features to ChucK for me. I'm leaving that to the pros!
The problem is that these other GCs probably are not designed for real time issues, but to speed up programming. One needs something that can be used in real time systems. Hans
At the risk of sounding like the old codger who says "back when I was a kid, I walked sixteen miles in the snow barefoot to school every day..." ... On 22 Jul 2009, at 01:03, Hans Aberg wrote:
On 22 Jul 2009, at 00:06, Tom Lieber wrote:
<snip>...GC delays.
Concurrent GC would be nice, but making the VM multi-threaded is not the only way to synthesize sounds faster.
<snip> The problem is not getting the GC fast enough, but they tend to do collecting at a specific time, when the whole program sort of halts.
Once upon a time, I worked at a company called Lucid. We made Common Lisp Compilers. Prior to Lucid, most Lisp systems would grind to a halt whenever a GC took place. One of the major contributions made by Patrick Solbavarro was an ephemeral garbage collector for general- purpose computers[*]: every time you allocated some memory, it did a little bit of reclamation. Moral of the story: Ephemeral GCs techniques are mature, widely understood and are exactly what's needed in a real-time environment. - Rob [*] Soba88 Sobalvarro, Patrick G., "A lifetime-based garbage collector for LISP systems on general-purpose computers," B.S. thesis, MIT EECS Dept. 1988.
On 22 Jul 2009, at 17:16, Robert Poor wrote:
<snip> The problem is not getting the GC fast enough, but they tend to do collecting at a specific time, when the whole program sort of halts.
Once upon a time, I worked at a company called Lucid. We made Common Lisp Compilers. Prior to Lucid, most Lisp systems would grind to a halt whenever a GC took place. One of the major contributions made by Patrick Solbavarro was an ephemeral garbage collector for general-purpose computers[*]: every time you allocated some memory, it did a little bit of reclamation.
Moral of the story: Ephemeral GCs techniques are mature, widely understood and are exactly what's needed in a real-time environment.
So is the theory, but the comp.compilers thread mentioned some practical problems. If you want to collect little by little you need some large buffers, and tracing can cause virtual memory swaps on modern systems. But if you you one which is tested and can guarantee no delays in the programming threads, let's hear. Hans
On Wed, 22 Jul 2009, Hans Aberg wrote:
But if you you one which is tested and can guarantee no delays in the programming threads, let's hear.
Supercollider uses a real-time incremental garbage collector based on this paper: Paul R. Wilson, Uniprocessor Garbage Collection Techniques, Proceedings of the 1992 International Workshop on Memory Management, Springer-Verlag, Berlin, 1992. ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps Wilson's method doesn't have to run in a separate thread (so no thread locking), the work done per allocation is strictly bounded (i.e. it's guaranteed that there are no long garbage-collection pauses during allocation), and the overhead per heap access is likewise constant (and tiny.) There have been advances in the state of the art since 1992, but this method works fine for high-throughput systems like Supercollider, and is certainly adequate for low-latency, low-performance systems like ChucK. -- Tom Duff. Electrically engraved with the world's finest entertainment.
On 23 Jul 2009, at 18:54, Tom Duff wrote:
On Wed, 22 Jul 2009, Hans Aberg wrote:
But if you you one which is tested and can guarantee no delays in the programming threads, let's hear.
Supercollider uses a real-time incremental garbage collector based on this paper:
Paul R. Wilson, Uniprocessor Garbage Collection Techniques, Proceedings of the 1992 International Workshop on Memory Management, Springer-Verlag, Berlin, 1992. ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps
Wilson's method doesn't have to run in a separate thread (so no thread locking), the work done per allocation is strictly bounded (i.e. it's guaranteed that there are no long garbage-collection pauses during allocation), and the overhead per heap access is likewise constant (and tiny.)
This is a survey article. Which method is used?
There have been advances in the state of the art since 1992, but this method works fine for high-throughput systems like Supercollider, and is certainly adequate for low-latency, low-performance systems like ChucK.
The thread in comp.compilers on reference counting mentioned that the Java collector, when buffered grew beyond 50 MB or so, could choke and hang for tens seconds, even on the latest Macs. Even if the GC works fine on a machine without virtual memory, it could choke in a virtual memory environment, if tracing takes place on swapped out memory. The whole system can choke. The before mentioned thread also gave some references, but none that actuaööy solves the problems. And Chuck is in fact resource demanding because of its exact timing model. Simple IO in fact causes delays, enough to make real time medoic playing difficult (though I tried it in a not so fast computer). Hans
On Thu, 23 Jul 2009, Hans Aberg wrote:
On 23 Jul 2009, at 18:54, Tom Duff wrote:
On Wed, 22 Jul 2009, Hans Aberg wrote:
But if you you one which is tested and can guarantee no delays in the programming threads, let's hear.
Supercollider uses a real-time incremental garbage collector based on this paper:
Paul R. Wilson, Uniprocessor Garbage Collection Techniques, Proceedings of the 1992 International Workshop on Memory Management, Springer-Verlag, Berlin, 1992. ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps
Wilson's method doesn't have to run in a separate thread (so no thread locking), the work done per allocation is strictly bounded (i.e. it's guaranteed that there are no long garbage-collection pauses during allocation), and the overhead per heap access is likewise constant (and tiny.)
This is a survey article. Which method is used?
Sorry, missed by one: the correct reference is one paper north on Wilson's page: http://www.cs.utexas.edu/~oops/papers.html Paul R. Wilson and Mark S. Johnstone. Real-Time Non-Copying Garbage Collection. Position paper for the 1993 ACM OOPSLA Workshop on Memory Management and Garbage Collection, Washington D.C. September 1993. ftp://ftp.cs.utexas.edu/pub/garbage/GC93/wilson.ps
There have been advances in the state of the art since 1992, but this method works fine for high-throughput systems like Supercollider, and is certainly adequate for low-latency, low-performance systems like ChucK.
The thread in comp.compilers on reference counting mentioned that the Java collector, when buffered grew beyond 50 MB or so, could choke and hang for tens seconds, even on the latest Macs.
I'm not sure why this is relevant or responsive. All this demonstrates is that if you use inappropriate methods, you may very well get inappropriate results.
Even if the GC works fine on a machine without virtual memory, it could choke in a virtual memory environment, if tracing takes place on swapped out memory. The whole system can choke. The before mentioned thread also gave some references, but none that actually solves the problems.
Again, I'm not sure why this is relevant. Virtual memory is a relic of a bygone era. I don't recall having a paging-related performance problem in music software in the last 15 years. Even my laptop has 8GB of primary storage. Nothing ever gets paged out.
And Chuck is in fact resource demanding because of its exact timing model. Simple IO in fact causes delays, enough to make real time medoic playing difficult (though I tried it in a not so fast computer).
What does this have to do with garbage collection? The last time I looked, a couple of months ago, Chuck's garbage collector was a stub, and even the stub wasn't called from anywhere. It doesn't have one! Have you used Supercollider? Does it have the delay problems you're worried about? Are they attributable to the garbage collector? I don't know what you mean by "Simple IO" and in fact I don't believe that delay inhibits real time melodic playing. Tracker action pipe organs typically have key-to-sound delays of 250 milliseconds or more. For centuries they have been the primary instruments of many of the foremost improvisers in western music. Unpredictable timing (delay jitter) is undoubtedly a musical problem, but Wilson's incremental real-time garbage collector, as used in Supercollider, is designed specifically to eliminate that. -- Tom Duff. The telephone was the death of Montparnasse.
On 23 Jul 2009, at 21:10, Tom Duff wrote:
The thread in comp.compilers on reference counting mentioned that the Java collector, when buffered grew beyond 50 MB or so, could choke and hang for tens seconds, even on the latest Macs.
I'm not sure why this is relevant or responsive. All this demonstrates is that if you use inappropriate methods, you may very well get inappropriate results.
It means it it is hard to use appropriate methods on modern systems.
Even if the GC works fine on a machine without virtual memory, it could choke in a virtual memory environment, if tracing takes place on swapped out memory. The whole system can choke. The before mentioned thread also gave some references, but none that actually solves the problems.
Again, I'm not sure why this is relevant. Virtual memory is a relic of a bygone era. I don't recall having a paging-related performance problem in music software in the last 15 years. Even my laptop has 8GB of primary storage. Nothing ever gets paged out.
What do you mean by this? All computers use virtual memory these days. They page in and out all the time, and you do not have control over it. You can decrease the effect by putting in more RAM, but that is no guarantee.
And Chuck is in fact resource demanding because of its exact timing model. Simple IO in fact causes delays, enough to make real time medoic playing difficult (though I tried it in a not so fast computer).
What does this have to do with garbage collection?
It is hard even to reserve small time-gaps for stdout.
The last time I looked, a couple of months ago, Chuck's garbage collector was a stub, and even the stub wasn't called from anywhere. It doesn't have one!
The context was to find one suitable for Chuck.
Have you used Supercollider? Does it have the delay problems you're worried about? Are they attributable to the garbage collector?
You tell me. In Chuck, if I play a fast rachenitsa on the keyboard, I notice that the ornaments become delayed if there is some IO to console going on. Does that happen in SuperCollider? If you have some really big programs in SC, you the GC must be used much, does that affect the timing? HOw do check that the GC is in use on SC? The thing is that a GC can be made arbitrarily effective by increasing its buffer. If it is infinite, just allocate memory and never deallocate. Memory allocation can be made in perhaps just one or a few cycles. This is the case of a two-space copying GC. But when GC time comes it becomes slow. If you increase the buffer, it means that it takes longer time before GC, but there is bigger problem ahead. Your comments suggest that you just increase RAM to avoid GC time and virtual memory paging. But that is no solution of the GC problem - just avoiding it.
I don't know what you mean by "Simple IO" and in fact I don't believe that delay inhibits real time melodic playing. Tracker action pipe organs typically have key-to-sound delays of 250 milliseconds or more. For centuries they have been the primary instruments of many of the foremost improvisers in western music.
Unpredictable timing (delay jitter) is undoubtedly a musical problem, but Wilson's incremental real-time garbage collector, as used in Supercollider, is designed specifically to eliminate that.
The problem is that it is unpredictable. Mostly noticeable in rapid movements. In Chuck, it happens even if there is no CPU overload. Seems tied to its single thread implementation. Hans
On 23 Jul 2009, at 21:10, Tom Duff wrote:
This is a survey article. Which method is used?
Sorry, missed by one: the correct reference is one paper north on Wilson's page: http://www.cs.utexas.edu/~oops/papers.html
Paul R. Wilson and Mark S. Johnstone. Real-Time Non-Copying Garbage Collection. Position paper for the 1993 ACM OOPSLA Workshop on Memory Management and Garbage Collection, Washington D.C. September 1993.
ftp://ftp.cs.utexas.edu/pub/garbage/GC93/wilson.ps
It is hard to get an idea of this without somebody testing it thoroughly. But a quick netsearch says that a non-copying GC gets problem with memory fragmentation. So there are research papers on how to handle that. This is also a problem that can be handled by adding more RAM. But what happens is that one avoids doing GC causing the problem, not solving the GC problem. Also, it is a uniprocessor GC (or so I though I saw). There is a comment here about OCaml have that restriction (last sentences in section): http://en.wikipedia.org/wiki/OCaml#Features Hans
Tom Duff; The last time I
looked, a couple of months ago, Chuck's garbage collector was a stub, and even the stub wasn't called from anywhere. It doesn't have one!
To place one note in the margin of your post (that I otherwise agree with); We do seem to have some garbage collection or at least we have some reference counting and collection. A while ago Mike had his big fight with the type system and according to the latest theories -if I remember correctly- under some conditions passing references to objects (especially home-made classes in arrays) there would be one reference too many that would be removed which would lead to extremely hard to trace crashes. As far as I know at that point we had a bug in the GC we don't fully have yet. Aside from that I agree with what you said (at least the things I already know about). I also would like to remark that Ge&co have been thinking about GC for a long time now. I'm certain they will already be aware of any and all papers on GC in realtime systems as the problem is well known and recognised. Evidently it's solvable as SC seems to be doing fine as well. Oh, and I'd like to also mention that I *have* made ChucK use virtual memory. I found that when you get into that kind of situation, in my exprerience, you won't be running in realtime but will be making coffee (or dinner!) while your laptop's fan gets a workout ;-). Yours, Kas.
On 24 Jul 2009, at 00:09, Kassen wrote:
Oh, and I'd like to also mention that I *have* made ChucK use virtual memory. I found that when you get into that kind of situation, in my exprerience, you won't be running in realtime but will be making coffee (or dinner!) while your laptop's fan gets a workout ;-).
What might this mean? On UNIX computers, each process gets a couple of address spaces on their own, like function stack, heap, program static data; on a 32-bit machine they are typically 2 GB each and handled by virtual memory, built into the OS. One can check usage by commands like 'systat vmstat', 'vm_stat'. If there is too much paging, there is too little available RAM relative the programs in use. Hans
Hans; What might this mean? By this I mean that I tried to design my own reverb for moving soundsources according to a particularly inefficient design once. This had to render to a pair of wave files in non-real time and at some point it started swapping. It also caused other issues and I had to abandon the plan for then, later this turned out to be related to a array bug that has since then be fixed (the design was still bad though it was a interesting experiment). Though they are of course not strictly speaking related I do think that very high memory usage (and by the time you exhaust a modern computer you are using quite a lot) is quite likely co-related to high CPU usage. If you need a Gig or so for your objects, your samples and your buffers in ChucK you are probably also running into CPU issues at the same time. On UNIX computers, each process gets a couple of address spaces on their
own, like function stack, heap, program static data; on a 32-bit machine they are typically 2 GB each and handled by virtual memory, built into the OS. One can check usage by commands like 'systat vmstat', 'vm_stat'. If there is too much paging, there is too little available RAM relative the programs in use.
This particular experiment was done on my stripped version of Windows XP but I don't think that affects matters much. A modern computer will most likely have at least half a Gig off RAM available for the user to spend at will on things like ChucK before any noticable swapping will be done. I'm simply not very worried about that right now because I don't see how we'd use all of that up in a way that *might* cause issues with some forms of GC that we may or may not use in the future. If you have a longer composition that uses a lot of previously recorded instruments that you'd like to mix to a single file using ChucK you may need a amount of memory that starts to approach the order of magnitude that causes swapping. That's the only case I can think of right now but a) likely that's not the sort of case that will trigger lots of GC on that data and b) I think we are already re-claiming space for samples that are no longer used and so far nobody had complained that this caused huge breakdowns in the realtime paradigm. In fact I can't remember a single complaint about this at all. Loading huge samples from HD while using a small latency is likely a much larger issue. Yours, Kas.
On 24 Jul 2009, at 01:50, Kassen wrote:
If you have a longer composition that uses a lot of previously recorded instruments that you'd like to mix to a single file using ChucK you may need a amount of memory that starts to approach the order of magnitude that causes swapping. That's the only case I can think of right now but a) likely that's not the sort of case that will trigger lots of GC on that data and b) I think we are already re-claiming space for samples that are no longer used and so far nobody had complained that this caused huge breakdowns in the realtime paradigm. In fact I can't remember a single complaint about this at all. Loading huge samples from HD while using a small latency is likely a much larger issue.
The GC question might be irrelevant, because music does not require so much power relative graphics, and standard malloc/free is just perhaps hundred times or even less slower than a fast GC. By Moore's law (which I checked on Macs), the chip size double every second year, and for CPU frequency it is about every third. So combined (using multicore) that gives five doublings in six years. On the other hand, implementing an advanced GC takes up a lot programming time. A slow HD might be countered by using SSD instead. They are quite expensive right now, but prices will surely drop. HDs are buffered, and there are hybrid HD-SSD (by Samsung). Hans
Hans; The GC question might be irrelevant, because music does not require so much
power relative graphics, and standard malloc/free is just perhaps hundred times or even less slower than a fast GC. By Moore's law (which I checked on Macs), the chip size double every second year, and for CPU frequency it is about every third. So combined (using multicore) that gives five doublings in six years. On the other hand, implementing an advanced GC takes up a lot programming time.
I do think GC is relevant. I just don't think the effect of swapping on the time GC takes is that relevant. While we may only rarely need to use so much memory that we run out of RAM at a single moment we may want to use ChucK in a installation that runs for a month. Right now if we want that we'd need to make very sure there is no memory leak which is quite ricky in ChucK right now; I think function calls can leak, for example. CPU usage is a very different matter. Moore's law stopped affecting us much for ChucK which doesn't multi-thread, actually clockspeeds went down a bit. I find new laptops in stores right now less appealing than the ones that were there a few years ago when you could still get a non-wide screen, cardbus and a floppy drive. I may be somewhat unique in that but until I seriously need to run both realtime graphics and realtime sound on the same machine I see no real reason to upgrade to a multi-core at all. To me it's a bit like owning 10 toilets; you'll still only be able to use one at a time. Yours, Kas.
On 24 Jul 2009, at 03:19, Kassen wrote:
The GC question might be irrelevant, because music does not require so much power relative graphics, and standard malloc/free is just perhaps hundred times or even less slower than a fast GC. By Moore's law (which I checked on Macs), the chip size double every second year, and for CPU frequency it is about every third. So combined (using multicore) that gives five doublings in six years. On the other hand, implementing an advanced GC takes up a lot programming time.
I do think GC is relevant. I just don't think the effect of swapping on the time GC takes is that relevant. While we may only rarely need to use so much memory that we run out of RAM at a single moment we may want to use ChucK in a installation that runs for a month. Right now if we want that we'd need to make very sure there is no memory leak which is quite ricky in ChucK right now; I think function calls can leak, for example.
Here I meant ordinary memory malloc/free cleanup as compared to a GC which runs at its own times doing it. The latter is faster. The malloc() that comes with the C compiler typically takes several tens, sometimes hundreds, of cycles for each memory allocation - it has to cope with memory fragmentation. A GC can do that much faster, but with the collection time problems. But if computers get powerful enough, that difference may not of importance to the application at hand.
CPU usage is a very different matter. Moore's law stopped affecting us much for ChucK which doesn't multi-thread, actually clockspeeds went down a bit.
That is a big problem: Chuck needs to be multi-threaded if it should be able to make use of this new computing power.
I find new laptops in stores right now less appealing than the ones that were there a few years ago...
You pick up a PowerBook at a good price in this point of time, but beware that the lids of the Ti-books will drop off after some time of use because the hinge-structure is weak. But they can still be used, with an external display.
...when you could still get a non-wide screen, cardbus and a floppy drive.
I haven't seen a floppy drive for a long time, though they seem to have been longer in use on the PC side.
I may be somewhat unique in that but until I seriously need to run both realtime graphics and realtime sound on the same machine I see no real reason to upgrade to a multi-core at all. To me it's a bit like owning 10 toilets; you'll still only be able to use one at a time.
Multi-core can still be useful if one can run more than one copy of the program. Hans
Hans;
That is a big problem: Chuck needs to be multi-threaded if it should be able to make use of this new computing power.
Yes. However using multiple CPU's to distribute the work of a program like ChucK is a non-trivial matter. For one thing it's not always possible. consider this; 5 => int n; SinOsc sines[n]; for(int i; i< sines.cap(); i++) { 2 => sines[i].sync; 1 + (10*i) => sines[i].gain; for(int j; j< sines.cap(); j++) { if(i != j) sines[i] => sines[j]; } } sines [0] => dac; 5::second => now; That is a highly interconnected UGen graph and I think we will agree that for any CPU there is a number "n" at which point this will no longer run on that CPU. At that point distributing the calculation will likely not help you much either, if it's possible at all. Most UGen graphs aren't like this, however they may become like that due to new connections made. This means that we may have to re-optimise the whole graph, relative to the CPU cores, for any connections made. You can then try to optimise for either a minimum of dropouts or a most efficient use of the available power.... This is non-trivial. Our friends over in the SC scene didn't solve that one yet either.
You pick up a PowerBook at a good price in this point of time, but beware that the lids of the Ti-books will drop off after some time of use because the hinge-structure is weak. But they can still be used, with an external display.
I personally don't feel Powerbooks suit my particular needs very well. Right now I think I could get more of a performance boost by getting into configuring a custom realtime Linux kernel on hardware I already own.
I haven't seen a floppy drive for a long time, though they seem to have been longer in use on the PC side.
I don't think that's a Mac/Pc issue, at least it isn't for me. My interest in them has to do with liking the sound of old samplers and MIDI sample dump being attrociously slow. Floppies are small, prone to breaking and downright archaic... but very convenient for feeding samples to old samplers. In my dream world we could buy a soundcard with hardware plugins for various DA converters in order to get that type of sound. Multi-core can still be useful if one can run more than one copy of the
program.
Absolutely. Kas.
On 24 Jul 2009, at 17:59, Kassen wrote:
That is a big problem: Chuck needs to be multi-threaded if it should be able to make use of this new computing power.
Yes. However using multiple CPU's to distribute the work of a program like ChucK is a non-trivial matter. For one thing it's not always possible. consider this;
5 => int n;
SinOsc sines[n];
for(int i; i< sines.cap(); i++) { 2 => sines[i].sync; 1 + (10*i) => sines[i].gain; for(int j; j< sines.cap(); j++) { if(i != j) sines[i] => sines[j]; } }
sines [0] => dac; 5::second => now;
That is a highly interconnected UGen graph and I think we will agree that for any CPU there is a number "n" at which point this will no longer run on that CPU. At that point distributing the calculation will likely not help you much either, if it's possible at all.
Chuck threads, or "shreds", need only be synchronized every sample time. So intermediate calculations can be put on several system (POSIX) threads. So that is one possibility. So is that not possible in your example? If not, one may need to invent new syntax which is parallelizable. Specifically, highly serial "for" loops can probably not be done much about without such rewriting. A second possibility might be have several Chuck synchronizing threads, each synchronizing a set shreds, but between the sets there is a less rigid timing requirement, like a hundredth of second. This would be suitable to make different sound generators. Hans
Hans; Chuck threads, or "shreds", need only be synchronized every sample time. So
intermediate calculations can be put on several system (POSIX) threads. So that is one possibility.
If only it were so simple. consider a variable and three shreds. Let all three shreds read the variable, apply some manipulation to it and write the result back, then advance time by "1.0/Std.rand2(3,8)::samp". Advancing time by fractions of a samp is perfectly valid; we have a temporal resolution that goes down to a smaller granularity than clock ticks of the host CPU. Evidently this is all purely imaginary and only used to determine calculation order to a extreme precision. Merely syncing every samp isn't precise enough; no matter how often you would sync I will be able to write a ChucK program that will yield a different result on your multi-threaded system from what it currently yields on the single threaded strongly timed VM, at least until you start syncing at arbitary moments in-between CPU instructions. I'm fairly sure you won't be doing that. At the very least the moments you sync at should take the way shreds advance time into account. Another thing is that you do not yet have a solution for how to divide n+i shreds over n cores when it can't be determined ahead of time how much time any of them will take. It's a non-trivial issue to say the least. The whole syntax of ChucK looks like it is meant to enable concurency but behind the screens it is actually dealing with calculation order instead. Right now it simply attempts to do that to the best of it's abilities and when the CPU runs out that's it. In a multi-threaded system you will need to determine exactly what "to the best of it's abilities" means for the next period and that's notoriously hard.
So is that not possible in your example?
No. my example deals with the UGen graph. For the 5 seconds it runs no shreds are doing any work, it's meant to illustrate a UGen graph that won't run on a single core for a given number of SinOsc's yet that can't benefit from multiple cores either (as far as I know) because all UGens are interconected. It's meant to show that multiple CPU's aren't always a substitute for clockspeed. This is a well known phenomenon, I wrote the example speciffically to cause that sort of problem.
If not, one may need to invent new syntax which is parallelizable. Specifically, highly serial "for" loops can probably not be done much about without such rewriting.
Yes, but not for all kinds of calculation. For some kinds of calculation dualcore is nearly as fast as a single CPU at twice the speed, for others the second core won't help at all. Most ChucK programs (or at least the kind of caluculation they describe) will be somewhere in between those two extremes.
A second possibility might be have several Chuck synchronizing threads, each synchronizing a set shreds, but between the sets there is a less rigid timing requirement, like a hundredth of second. This would be suitable to make different sound generators.
Maybe.... I'm not saying we can't ever benefit from multiple CPU configurations at all, "merely" pointing out that there are very hard questions there. Yours, Kas.
On 24 Jul 2009, at 19:57, Kassen wrote:
Chuck threads, or "shreds", need only be synchronized every sample time. So intermediate calculations can be put on several system (POSIX) threads. So that is one possibility.
If only it were so simple. consider a variable and three shreds. Let all three shreds read the variable, apply some manipulation to it and write the result back, then advance time by "1.0/ Std.rand2(3,8)::samp". Advancing time by fractions of a samp is perfectly valid; we have a temporal resolution that goes down to a smaller granularity than clock ticks of the host CPU. Evidently this is all purely imaginary and only used to determine calculation order to a extreme precision.
If time isn't real, and only emulated, it might be parallelized, but may require special syntax for it. One concept is "causality": if at some computed time, a decision is made as how to proceed with new calculations, then that is not immediately parallelizable. Only things that flows on their own in such a causality diagram can be parallelized.
Merely syncing every samp isn't precise enough; no matter how often you would sync I will be able to write a ChucK program that will yield a different result on your multi-threaded system from what it currently yields on the single threaded strongly timed VM, at least until you start syncing at arbitary moments in-between CPU instructions. I'm fairly sure you won't be doing that. At the very least the moments you sync at should take the way shreds advance time into account.
The idea is to treat time between samples as only computed, emulated, and not real. The program computes ahead, and collects the values for the next sample which is presented.
Another thing is that you do not yet have a solution for how to divide n+i shreds over n cores when it can't be determined ahead of time how much time any of them will take.
It's a non-trivial issue to say the least. The whole syntax of ChucK looks like it is meant to enable concurency but behind the screens it is actually dealing with calculation order instead. Right now it simply attempts to do that to the best of it's abilities and when the CPU runs out that's it. In a multi-threaded system you will need to determine exactly what "to the best of it's abilities" means for the next period and that's notoriously hard.
So in this vein, I posted a POSIX-threaded sample illustrating how to reuse threads when searching files in small 'grep'-like program. One can use any number of threads for any number of files. Each thread opens up a file and searches it. When one thread is finished, if there are more files to search, it continues to the next. So if this can be down with sample times as snapshots, all that is needed is sufficient CPU power to complete all computations until the next sample is presented.
So is that not possible in your example?
No. my example deals with the UGen graph. For the 5 seconds it runs no shreds are doing any work, it's meant to illustrate a UGen graph that won't run on a single core for a given number of SinOsc's yet that can't benefit from multiple cores either (as far as I know) because all UGens are interconected. It's meant to show that multiple CPU's aren't always a substitute for clockspeed. This is a well known phenomenon, I wrote the example speciffically to cause that sort of problem.
Is this some kind of resource starvations setup? Like in: http://en.wikipedia.org/wiki/Dining_philosophers_problem http://en.wikipedia.org/wiki/Resource_starvation Then those things cannot be prevented as per design of the computer language, just as one cannot prevent non-termination (infinite loops) being programmed. So one is back at finding better syntaxes that supports what works. It may require new syntax. Not immediately related to this, I just happened to read a bit about it, Apple has launched some "Grand Central Dispatch" that involve syntax extensions to C, C++ and Obj-C that allows one to identify block of parallelizable code. http://en.wikipedia.org/wiki/Grand_Central_Dispatch Hans
On Fri, Jul 24, 2009 at 8:42 PM, Hans Aberg
The idea is to treat time between samples as only computed, emulated, and not real. The program computes ahead, and collects the values for the next sample which is presented.
Hi! Consider this: SinOsc osc => dac; while (true) { for (0 => int i; i < 10; i++) { i*1000 => osc.freq; 1::samp => now; } } Stuff like this can become pretty difficult to try to compute ahead, since general ChucK code plays on the same timing playfield as the UGens. Also, consider if the for loop uses the current output value from osc in its calculation, causing a compute ahead-loop. Also, I think you can put the for loop in a different file than the oscillator it manipulates (with some added class magic). /Stefan
On 24 Jul 2009, at 21:15, Stefan Blixt wrote:
Hi!
Consider this:
SinOsc osc => dac;
while (true) { for (0 => int i; i < 10; i++) { i*1000 => osc.freq; 1::samp => now; } }
Stuff like this can become pretty difficult to try to compute ahead, since general ChucK code plays on the same timing playfield as the UGens.
This is in fact interesting, because one can think of different models: In one, one wants computations to be fully deterministic. This would be suitable for building a new sound generator. Then the details of the generator should be fully known, so that it can be computed ahead. Then those computations might in part be parallelized, in as much they do not interdepend. In the other model, the sound generator is considered external to the changes made to them. This would be like a musician playing on a string say. The musician has control over the string, but only partially - the string is a parallel object. So in this case, the oscillator can be put into a separate thread.
Also, consider if the for loop uses the current output value from osc in its calculation, causing a compute ahead-loop. Also, I think you can put the for loop in a different file than the oscillator it manipulates (with some added class magic).
So this would be the same, but more complicated. Hans
Hans; So in this vein, I posted a POSIX-threaded sample illustrating how to reuse
threads when searching files in small 'grep'-like program. One can use any number of threads for any number of files. Each thread opens up a file and searches it. When one thread is finished, if there are more files to search, it continues to the next.
That example is set in a entirely different context. For one thing such searches aren't a realtime application for another the results of any one search don't depend on the outcome of any other and won't be affected by them. Such things can be paralelised very well. This is very different from a ChucK VM where any of the parts may interact with and depend on any number of other things.
So if this can be down with sample times as snapshots, all that is needed is sufficient CPU power to complete all computations until the next sample is presented.
You seem to be describing a situation where values travel through the UGen graph at a rate of one sample per UGen to travel through. That indeed is a scenario that can be paralelised well, but that's quite different from how it currently works where the output of a single UGen will start affecting the value at the DAC in the very sample sample, thanks to our "pull through" model. One of the big advantages of the model we are using now over -say- block processing is that we can have predicatble tuned feedback loops. Is this some kind of resource starvations setup? Like in:
http://en.wikipedia.org/wiki/Dining_philosophers_problem http://en.wikipedia.org/wiki/Resource_starvation Then those things cannot be prevented as per design of the computer language, just as one cannot prevent non-termination (infinite loops) being programmed.
Indeed. I meant it to illustrate that multi-cores are no real extention of Moore's law. Not everything can be paralelised. Another example would be calculating the first million numbers of the Finbonachi serries, that won't be done sooner just by having more cores because every next number depends on the last one. Yours, Kas.
On 24 Jul 2009, at 21:21, Kassen wrote:
So in this vein, I posted a POSIX-threaded sample illustrating how to reuse threads when searching files in small 'grep'-like program. One can use any number of threads for any number of files. Each thread opens up a file and searches it. When one thread is finished, if there are more files to search, it continues to the next.
That example is set in a entirely different context. For one thing such searches aren't a realtime application for another the results of any one search don't depend on the outcome of any other and won't be affected by them. Such things can be paralelised very well.
This is very different from a ChucK VM where any of the parts may interact with and depend on any number of other things.
What I am saying I think this independence is the requirement for parallelization. So this parts must be identified from the Chuck code that is now mostly written sequentially. This is a hard thing to do, of course. But then, such independent parts can be computed in parallel or sequentially at need, and the scheduling on different CPUs is not hard.
So if this can be down with sample times as snapshots, all that is needed is sufficient CPU power to complete all computations until the next sample is presented.
You seem to be describing a situation where values travel through the UGen graph at a rate of one sample per UGen to travel through.
The picture I have in my mind is a bit more complicated. The value at each sample time is a dead-line that should be reported. Between those, one may have a complicated directed graph of events and computations between them. Arrows going in parallel can be be parallelized, and computed by reusing ordinary threads. Arrows what meet or branch off must be met up, but not in real-time, except for the next sample-time reporting. And this is the best one can do.
That indeed is a scenario that can be paralelised well, but that's quite different from how it currently works where the output of a single UGen will start affecting the value at the DAC in the very sample sample, thanks to our "pull through" model. One of the big advantages of the model we are using now over -say- block processing is that we can have predicatble tuned feedback loops.
Perhaps the current Chuck model isn't well adapted for such a thing. Hans
On 22 Jul 2009, at 01:03, Hans Aberg wrote:
On 22 Jul 2009, at 00:06, Tom Lieber wrote: <snip>
<snip>I'm more interested in bringing ChucK ideas to other languages than the other way around right now. Other VMs have had a few more eyes and hands on them, and adding timing rules to them has been easier than adding language features to ChucK for me. I'm leaving that to the pros!
The problem is that these other GCs probably are not designed for real time issues, but to speed up programming. One needs something that can be used in real time systems.
I side STRONGLY with Tom on this one. If the GC for language [X] isn't appropriate for real-time, then write a new GC and make it available to the world. That seems like more fruitful use of time than, say, creating closures for ChucK. Just my own half a nibble... - Rob
participants (6)
-
Hans Aberg
-
Kassen
-
Robert Poor
-
Stefan Blixt
-
Tom Duff
-
Tom Lieber