I thank the referee for their constructive remarks. I have revised the
manuscript to address each of ther criticisms --- specific responses
are detailed below. Areas of the new manuscript that have seen
significant changes are typset in bold.

> This paper describing the Lomb-Scargle Periodogram, its refactoring
> for efficient parallel implementation, and its calculation on modern
> GPU hardware is concise and well-written.  It demonstrates that even
> moderately straightforward signal processing applications can benefit
> substantially by reimplementation on GPU hardware.  Nevertheless, it
> would be nice to see some more immediate motivation for, and
> application of, the work presented, but I imagine this will become
> apparent over the next year or two.  In light of its technical nature
> and possibly limited audience at this time though, I recommend it for
> publication in the Supplement Series of the Journal, rather than the
> main journal itself.

I am happy for the paper to appear in ApJS.

> [Editor's comment:  For methodology papers, the Astrophysical Journal
> does require example(s) of application to astronomical datasets.  But
> in this case, where the LSP is used in >100 studies annually, this
> seems less necessary.  The author might add a paragraph to the
> introduction telling the less-expert reader how important LSP has been
> (and will be, e.g. for Pan-STARRS and LSST) giving some important
> application references.]

I have added remarks in the opening paragraph, and in Section 6.3, to
stress the importance of LSP.

> 1. I am concerned by the magnitude of the speed-up of the code
> executing on the Tesla system.  I don't necessarily doubt this
> achievement, but it is sufficiently outstanding that the author should
> explain how a speed-up of 200x is possible on a Tesla C1060 GPU
> compared to a E5345 Xeon CPU.  The raw floating-point performance of
> the C1060 compared to a single core of the E5345 is around 100x, and
> this assumes that a very specialised feature of the C1060 is being
> exploited, namely the ability to dual-issue multiply and multiply-add
> operations.  Where does the extra factor of >~2 come from?  Is it the
> choice to accept lower precision in the trigonometric functions?  If
> it is, can the same be done in the CPU implementation?  Is it due to
> memory bandwidth (unlikely given the author's assertion that their
> implementation is arithmetically intense)?  I think some discussion is
> needed in the paper to explain what's going on here.

I have expenaded Section 5.2 considerably to explain the performance
gap between CPU and GPU. (It's due to a combination of faster trig
functions and overlapping arithmetic/trig calculations).

> 2. Many authors are publishing papers which compare GPU
> re-implementations of code to existing CPU implementations, and almost
> without exception, massively parallel code on the newest GPU is
> compared to *serial* code on a late model CPU, even though the CPU(s)
> are frequently dual or quad-core CPUs.  Rather than discussing "price
> per core" in Section 6.1 as a way of handling this, I ask the author
> to apply some simple parallelisation to the CPU LSP code and actually
> measure its performance on the multi-CPU-core hardware at hand.  Given
> the refactoring of the LSP has already been done, the adaptation of
> the existing GPU kernel to the CPU should be easy using an extension
> such as OpenMP.  It is simply unfair to be claiming speed-ups of 200x
> when the CPU LSP code is using only 1/8th of the resources available
> to it by the author's design (why are you prepared to write parallel
> code for a GPU but not a CPU?).  Looking at price-per-core is still
> not quite right anyway: you cannot u se a GPU without a host CPU!  So
> overall I think the fairest comparison that can and should be made is
> the comparison of:
>
> (a) the performance of a single-core, entry-level machine ("base system") running the serial LSP
> (b) the performance of a multi-core, server-level machine running a parallel LSP
> (c) the performance of a GPU hosted in machine (a) or (b), running the parallel CULSP
>
> These comparisons should be easy for the author to do and they will
> still demonstrate the value of the CULSP implementation.  Then you can
> examine what it costs to boost performance above and beyond the base
> system.  Looking at CPU per-core prices and GPU prices in isolation is
> best avoided in my view.  Furthermore, I think it is reasonable to
> expect the author to make comparisons using a uniform hardware and
> operating system choice and not introduce second and third
> hardware-software platform combinations as is done in Section 6.1.
> The merit of comparing CPU performance over different hardware *and*
> operating system platforms is dubious and unscientific.  The author
> has presumably incurred no observing or travel expenses to do this
> work, so it seems reasonable to expect a small investment in hardware
> so that valid comparisons can be made.

These very valid points led me to significantly rework Sections 5 and
6.1 of the manuscript. The (CPU) LSP code has been parallelized using
OpenMP. A unified test platform --- the Dell Precision 490, with two
Xeon E5345 CPUs and the two GPUs --- is used for all of the
benchmarking. Comparisons are made between CULSP on each GPU, LSP
running with one OpenMP thread (equivalent to a serial code) and LSP
running with eight OpenMP threads. The GPUs still retain a clear edge
over the CPUs, but obviously the disparity is not as stark as before.

> 3. A recent paper by Barsdell et al. (arXiv:1007.1660) identifies the
> "interact" algorithm as a fundamental element of astronomy problems
> such as direct N-body force calculation and gravitational microlensing
> calculations.  It seems to me that the refactored LSP also matches
> this algorithm, and the kernel itself (Fig. 1) is very reminiscent of
> a basic N-body kernel - the author may like to comment.

A very interesting paper -- I wasn't aware of it. I've added text and
a new equation (now numbered 4) to Section 3, drawing attention to the
Barsdell study.

> 4. Fig. 4 - could this be one plot instead of two?  It would be easier
> to compare the LSP and CULSP results if they were on the same plot.  I
> don't think there's a space or overlap issue except maybe at low Nt.

Done.

> 5. In the commentary on the source code for the kernel, I think it
> would be helpful for the author to point out the meaning of the
> __syncthreads() and #pragma unroll lines of code.

Done (section 4.4)

> 6. Regarding Table 2, there is no indication if the measured times are
> from a single execution, or averaged over many executions.  This
> should be indicated.  Is precision to one-tenth of a second
> warranted?

I've reworked the table to give mean times from five executions. The
table also indicates the associated standard deviations. The beginning
of section 5.3 discusses this.

> 7. As far as I know, the main test system (Dell Precision 490) has
> only one PCI-e x16 slot.  It is unclear whether the author has both
> the C1060 and the 8400GS installed at the same time.  If so, the
> author should clarify which of the GPUs is installed in the x16 slot,
> and which is installed in a slower slot.  Transfers between the GPU
> and system memory will be faster in the x16 slot, and for many GPU
> applications this is important; the author could comment on whether
> this has any impact on execution times for the CULSP code.  I doubt it
> does given the small data volume of the test data used, but it may
> well be important for the applications proposed towards the end of the
> paper.

Done; section 5.1 clarifies the slot arrangement, and 5.3 discusses
the impact on execution times.

> 8. In Section 6.2 (Applications), the burden of computing many targets
> is briefly considered.  It should really be pointed out here that a
> server with 8 CPU cores might easily be able to run 8 instances of the
> serial LSP code simulatenously, and in such cases of coarse,
> embarrassingly data-parallel problems, the speed-up on the GPU of 200x
> is very much overstating the benefit of the GPU.

This isn't really necessary now, given that LSP now uses OpenMP and
can be run on the 8 cores.

Other changes I have made are as follows:

*) Typo corrections & rewording for clarification throughout.

*) Made minor modifications to the source code (leading to slightly
improved GPU performance and accuracy).

*) Deleted the small summary section, since it added nothing that
 wasn't already articulated in the abstract and discussion section.
