The S@h hibernation - long overdue IMHO - will free up some of Eric's time
and hopefully let us finish this project before the sun goes red giant.
Recent changes - described in my previous blog entries - have focused on finding and scoring multiplets. I think we're done with big changes in those areas. Now we move on to the last stage: deciding what scientific conclusions we want to make, and figuring out how to make them.
Finding ET would be a scientific conclusion. Failing that, we want to make (and quantitatively support) a statement about the "sensitivity" of our search - a statement of the form "if there were a radio beacon in frequency range X, in part Y of the sky, with power at least Z, our search would have detected it with probability P". It's hard to prove such a statement, but the "birdie" mechanism we've created in Nebula gives us a tool for doing so.
This is complicated by the fact that S@h is looking for lots of kind of signals, and we're more sensitive to some than others. We’re looking for signals in an abstract "space" with several dimensions:
- Frequency variation (up to 250 Hz for barycentric signals, up to 200 KHz for non-bary).
- Pulsed or not, and pulse parameters (period and duty cycle).
- Observation time for that sky position (i.e. pixel).
Those are the main ones. Others:
- Range of chirp rate (due to parameters of the sender’s planet).
- Amount of observing time with slew rate in the Gaussian range.
- Intrinsic bandwidth of the signal.
It’s likely that our sensitivity varies widely in different areas of this space. Our goal is to estimate the sensitivity in these areas. Doing this serves several purposes:
- Assuming we don’t find ET, sensitivity estimates are our main scientific result.
- It can guide our algorithm development: e.g. if our sensitivity to signals with big frequency variation is poor, we could try to improve the non-bary multiplet-finding algorithm.
- It can guide future radio SETI sky surveys by suggesting optimal scanning parameters.
The basic method for estimating sensitivity in an area A of the search space is:
- Generate a bunch of birdies in A, with a range of powers.
- See which of these birdies get “detected”, i.e. produce a multiplet whose score ranks in the top 1000 of non-birdie multiplets. (We assume that we'll manually example the top 1000 multiplets of each type, and that we'll be able to decide if one of them is ET).
- Find the power P for which most of the birdies of power P or greater are detected.
Specifics are given below. First, some general notes:
- We can’t study sensitivity to pulsed signals since we haven't implemented pulsed birdies. Future work.
- We can’t study factors specific to Gaussians (e.g. amount of time observed at Gaussian slew rates) since we currently don’t make Gaussians for birdies. Future work.
- Our handling of bary and non-bary signals is somewhat different, and the multiplet scores aren’t necessarily comparable. So we’ll handle the two signal classes separately.
- To get an accurate list of non-birdie multiplets, we need to do a complete Nebula run (RFI removal and scoring) without birdies. The presence of birdies could mask high-scoring non-birdie multiplets. Of course, birdies could mask each other; if this looks like an issue we can generate birdies so that they don’t overlap.
- We may want to estimate sensitivity where one parameter is limited and others are not. What should be the distribution of the “free” parameters? In the case of observation, it’s the actual distribution of our observations. For planetary parameters, we may as well use the distribution that Eric defined for generating birdies, based on the statistics of stars and observed exoplanets. Same for intrinsic bandwidth.
Now let’s return to the question of exactly how to estimate sensitivity for an area A (say, barycentric signals in pixels observed less than 10 minutes). Suppose we’ve generated a set B of birdies in A. For each birdie b we have the pair
where rank(b) is the rank of the highest-scoring multiplet containing signals in b, or +inf if there is none. Think of the scatter plot of these points. Ideally, rank generally decreases as power increases, and beyond some power most of the ranks are under 1000.
For a given power p, define F(p) as
F(p) = # birdies with power > p and rank < 1000 / #birdies with power > p
F is the fraction of birdies of power at least p which we detected. It’s piecewise constant. Ideally we’d like it to be monotonically increasing and asymptote to 1, but in practice neither of these is necessarily true.
Now pick a number 0 < C < 1. C is our target probability of finding ET. Let’s say that C = 0.5.
Define sensitivity(A) as the least p0 such that F(p)>C for p > p0. In other words, if there’s a signal in A with power at least p0, the probability that we’ll find it is at least C.
There may be no such p0, in which case our search is not sensitive within A. This means that no matter how powerful a signal is, our chance of finding it doesn’t go above the threshold C.
How do we generate birdies so that F(p) is statistically significant? How many birdies do we need, and what powers? I don’t currently have any concrete ideas. We could use input from a statistician. General suggestions:
- List the areas A for which we want to estimate sensitivity.
- Do a scoring run with some population of birdies.
- For each A, eyeball the scatter plot.
- If there aren’t enough birdies in A (say 50 or 100) add more.
- If there aren’t enough birdies with rank < 1000, add more high-power birdies in A.
- Get a rough idea of where sensitivity(A) is. Make sure there are a number of birdies with powers well below and well above this point.
But these are just heuristics. We should find a way to quantify the error in our estimates.
What areas of signal space should we study? Here’s a proposal:
Pixels with < 1 minute observation
Pixels with 1 - 10 minutes
Pixels with > 10 minutes (replace 1, 10 min with the terciles of actual obs times)
Signals with intrinsic BW < 1 Hz (optional)
Signals with intrinsic BW > 10 Hz (optional)
All pixels, all freq variations
Freq variation < 20 KHz
Freq variation 20 KHz - 100 KHz
Freq variation > 100 KHz
2-4, in pixels with < 1 min observation
2-4 in pixels with 1 - 10 min
2-4 in pixels > 10 min
That’s a total of 19 areas, and it covers the important dimensions.
If we can show our sensitivity in each of these areas,
that will make a nice paper; I'll be happy with that.