From: apjedf@astro.psu.edu
Date: August 6, 2010 4:56:07 PM CDT
To: townsend@astro.wisc.edu
Cc: apjedf@astro.psu.edu
Subject: Your ApJ Submission MS#APJS80920

August 6, 2010   

Prof. Richard H. D. Townsend
University of Wisconsin at Madison
Astronomy Department
475 N. Charter Street
Madison, WI 53706


Title: Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units

Dear Prof. Townsend,

I have received the referee's report on your above submission to The Astrophysical Journal, and appended it below. Both the referee and I think that your article is interesting and that it will merit publication once you have addressed the issues raised in the report.  I welcome the submission of a revised version; please include a detailed cover letter explaining the changes you've made to the text and your responses to the report.

Click the link below to upload your revised manuscript, which will work one time. 
<http://apj.msubmit.net/cgi-bin/main.plex?el=A7Ew1qZ6A4Cwi7F5A9sXwiSsr9FZnFl3xFZs8JVgZ>


Alternatively, can also log into your account at the EJ Press web site, http://apj.msubmit.net.  Please use your user's login name: ritownsend.  You can then ask for a new password via the Unknown/Forgotten Password link if you have forgotten your password. 

The Astrophysical Journal has adopted a new policy that manuscript files become inactive, and are considered to have been withdrawn, six months after the most recent referee's report goes to the authors, provided a revised version has not been received by that time.

If you have any questions, feel free to contact me.

Eric Feigelson, Scientific Editor
Astrophysical Journal
apjedf@astro.psu.edu
----------------------------------------------------------------------    
Referee Report   

This paper describing the Lomb-Scargle Periodogram, its refactoring for efficient parallel implementation, and its calculation on modern GPU hardware is concise and well-written.  It demonstrates that even moderately straightforward signal processing applications can benefit substantially by reimplementation on GPU hardware.  Nevertheless, it would be nice to see some more immediate motivation for, and application of, the work presented, but I imagine this will become apparent over the next year or two.  In light of its technical nature and possibly limited audience at this time though, I recommend it for publication in the Supplement Series of the Journal, rather than the main journal itself.

[Editor's comment:  For methodology papers, the Astrophysical Journal does require example(s) of application to astronomical datasets.  But in this case, where the LSP is used in >100 studies annually, this seems less necessary.  The author might add a paragraph to the introduction telling the less-expert reader how important LSP has been (and will be, e.g. for Pan-STARRS and LSST) giving some important application references.]


I do have two important comments that I would like to see addressed in a revised version of the paper:

1. I am concerned by the magnitude of the speed-up of the code executing on the Tesla system.  I don't necessarily doubt this achievement, but it is sufficiently outstanding that the author should explain how a speed-up of 200x is possible on a Tesla C1060 GPU compared to a E5345 Xeon CPU.  The raw floating-point performance of the C1060 compared to a single core of the E5345 is around 100x, and this assumes that a very specialised feature of the C1060 is being exploited, namely the ability to dual-issue multiply and multiply-add operations.  Where does the extra factor of >~2 come from?  Is it the choice to accept lower precision in the trigonometric functions?  If it is, can the same be done in the CPU implementation?  Is it due to memory bandwidth (unlikely given the author's assertion that their implementation is arithmetically intense)?  I think some discussion is needed in the paper to explain what's going on here.

2. Many authors are publishing papers which compare GPU re-implementations of code to existing CPU implementations, and almost without exception, massively parallel code on the newest GPU is compared to *serial* code on a late model CPU, even though the CPU(s) are frequently dual or quad-core CPUs.  Rather than discussing "price per core" in Section 6.1 as a way of handling this, I ask the author to apply some simple parallelisation to the CPU LSP code and actually measure its performance on the multi-CPU-core hardware at hand.  Given the refactoring of the LSP has already been done, the adaptation of the existing GPU kernel to the CPU should be easy using an extension such as OpenMP.  It is simply unfair to be claiming speed-ups of 200x when the CPU LSP code is using only 1/8th of the resources available to it by the author's design (why are you prepared to write parallel code for a GPU but not a CPU?).  Looking at price-per-core is still not quite right anyway: you cannot u
se a
GPU without a host CPU!  So overall I think the fairest comparison that can and should be made is the comparison of:

(a) the performance of a single-core, entry-level machine ("base system") running the serial LSP
(b) the performance of a multi-core, server-level machine running a parallel LSP
(c) the performance of a GPU hosted in machine (a) or (b), running the parallel CULSP

These comparisons should be easy for the author to do and they will still demonstrate the value of the CULSP implementation.  Then you can examine what it costs to boost performance above and beyond the base system.  Looking at CPU per-core prices and GPU prices in isolation is best avoided in my view.  Furthermore, I think it is reasonable to expect the author to make comparisons using a uniform hardware and operating system choice and not introduce second and third hardware-software platform combinations as is done in Section 6.1.  The merit of comparing CPU performance over different hardware *and* operating system platforms is dubious and unscientific.  The author has presumably incurred no observing or travel expenses to do this work, so it seems reasonable to expect a small investment in hardware so that valid comparisons can be made.

A few minor comments follow:

3. A recent paper by Barsdell et al. (arXiv:1007.1660) identifies the "interact" algorithm as a fundamental element of astronomy problems such as direct N-body force calculation and gravitational microlensing calculations.  It seems to me that the refactored LSP also matches this algorithm, and the kernel itself (Fig. 1) is very reminiscent of a basic N-body kernel - the author may like to comment.

4. Fig. 4 - could this be one plot instead of two?  It would be easier to compare the LSP and CULSP results if they were on the same plot.  I don't think there's a space or overlap issue except maybe at low Nt.

5. In the commentary on the source code for the kernel, I think it would be helpful for the author to point out the meaning of the __syncthreads() and #pragma unroll lines of code.

6. Regarding Table 2,  there is no indication if the measured times are from a single execution, or averaged over many executions.  This should be indicated.  Is precision to one-tenth of a second warranted?

7. As far as I know, the main test system (Dell Precision 490) has only one PCI-e x16 slot.  It is unclear whether the author has both the C1060 and the 8400GS installed at the same time.  If so, the author should clarify which of the GPUs is installed in the x16 slot, and which is installed in a slower slot.  Transfers between the GPU and system memory will be faster in the x16 slot, and for many GPU applications this is important; the author could comment on whether this has any impact on execution times for the CULSP code.  I doubt it does given the small data volume of the test data used, but it may well be important for the applications proposed towards the end of the paper.

8. In Section 6.2 (Applications), the burden of computing many targets is briefly considered.  It should really be pointed out here that a server with 8 CPU cores might easily be able to run 8 instances of the serial LSP code simulatenously, and in such cases of coarse, embarrassingly data-parallel problems, the speed-up on the GPU of 200x is very much overstating the benefit of the GPU.


