Computing Weighted Mean of Distance to Star Cluster

Astronomy, Astrophysics, and Cosmology are ever becoming fields rooted in data and computation. Since, unlikely other branches of physics, one is unable to easily perform experiments to verify results, it is common to leverage data science and statistics as well as build computational simulations.

In this bite, I'll use the CSV and DataFrames packages to read observational data of stars proposed to be in the Starfish Cluster. Each row contains the visual magnitude, logarithmic luminosity, and probability of the star being in the aforementioned cluster. The desired result is to determine the distance from Earth to the Starfish Cluster.

Incorporating both \(m = -2.5\log_{10}\Big(\frac{F}{F_0}\Big)\) and \(F = \frac{L}{4\pi d^2}\), I'll compute the distances using the formula,

\[ \log_{10}(d) = m + \log_{10}\Big(\frac{L}{L_0}\Big) + 0.17 \,.\]

This will give us the distance in units of parsecs. When talking about things in our own solar system, we will often use the Astronomical Unit (AU) which is the distance from Earth to our sun, Sol. When we're talking about things in our galaxy, the Milky Way, we'll often use the Light Year (ly), which is approximately \(63,241.1\) AU. Finally, when we're talking about objects in other galaxies or star groups, we'll generally use the Parsec (pc) which is approximately \(3.26\) ly. In Cosmology, since most objects we're working with are so, astronomically far away, we'll generally use the Megaparsec (Mpc) which is one-million Parsecs or \(3.26\) million light years. That's really far!

I'll use two sets, one where I've filtered out the low-probability stars and compute the mean and then one where I compute the weighted mean of all the stars. I'll compare the two to see how close our computation is.

First I'll compute the mean "by hand" or by manually iterating over each row without use of functions included in the DataFrames or statistics packages.

using CSV, DataFrames, StatsBase, Statistics
using LinearAlgebra: dot

const datapath = "/local/data/"
starfish_data = joinpath(datapath, "Starfish_Data.csv")

df = CSV.read(starfish_data, DataFrame);
# exclude low-probability stars
high_prob_stars = filter(prob -> prob.Prob > 70, df);

(rows, _) = size(df)
# all starfish star distances
all_starfish_distances = zeros(rows)
for (k, row) in enumerate(eachrow(df))
    vmag, logl, prob = row
    all_starfish_distances[k] = 10^(1/5 * (vmag + 2.5*logl + 0.17))
end

all_wmean = dot(
	all_starfish_distances, 
	df[:, 3]/100)/sum(df[:, 3]/100);
@show round(all_wmean; digits=4);
> round(all_wmean; digits = 4) = 1380.9686

(hp_rows, _) = size(high_prob_stars)
# high probability starfish star distances
hp_starfish_distances = zeros(hp_rows)
for (k, row) in enumerate(eachrow(high_prob_stars))
    vmag, logl, prob = row
    hp_starfish_distances[k] = 10^(1/5 * (vmag + 2.5*logl + 0.17))
end

hp_wmean = dot(
	hp_starfish_distances, 
	high_prob_stars[:, 3]/100)/sum(high_prob_stars[:, 3]/100);
@show round(hp_wmean; digits=4);
> round(hp_wmean; digits = 4) = 1384.9069

Checking the known distance to the Starfish Cluster, it looks like the approximation of about \(1380\) parsecs is quite accurate.

When I wrap each call in a @time call, we can see the time requirement.

> round(all_wmean; digits = 4) = 1380.9686
>   0.000569 seconds (2.41 k allocations: 108.484 KiB)

> round(hp_wmean; digits = 4) = 1384.9069
>   0.074162 seconds (305.37 k allocations: 16.432 MiB, 98.81% compilation time)

I'll also compute it using functionality from the DataFrames and StatsBase packages. Specifically, I'll write a function weighted_mean which will take a DataFrame object as input and will return the weighted mean to four digits. The function uses the combine function from the DataFrames package and the fweights function, for a weighted mean, from StatsBase.

In the function, I initially compute the weighted average since it will be reused for each iteration and we don't need to re-compute it every time. Then we're "combining" all of the records using equation \((1)\). It is clearly much more compact and clean using the included functions.

using CSV, DataFrames, StatsBase, Statistics

const datapath = "/local/usr"
starfish_data = joinpath(datapath, "Starfish_Data.csv")

function weighted_mean(df)
    df_fweights = fweights(df.Prob);
    df_wm = combine(df, 
        ["#Vmag", "logL", "Prob"] => ((vmag, logl, prob) -> begin
            mean(10 .^(1/5 * (vmag .+ 2.5*logl .+ 0.17)), df_fweights)
        end) => :weighted_mean)
    round(df_wm[:weighted_mean] |> first; digits=4)
end

df = CSV.read(starfish_data, DataFrame);
@show weighted_mean(df);
> weighted_mean(df) = 1380.9686

# exclude low-probability stars
high_prob_stars = filter(prob -> prob.Prob > 70, df);
@show weighted_mean(high_prob_stars);
> weighted_mean(high_prob_stars) = 1384.9069

Checking the time requirement again, I can see it actually takes quite a bit longer for the first computation than our manual approach. This is rarely the case, but makes the point of why it makes sense to always benchmark large computations. It appears that using the combine function has a bit more overhead. In this case, I am only processing around one-hundred records, so the clarity of the second approach is worthwhile to use here, but if we needed to run this for a much larger data set, it would be appropriate to use the first approach exclusively.

> weighted_mean(df) = 1380.9686
>   0.217225 seconds (459.91 k allocations: 27.381 MiB, 3.73% gc time, 99.60% compilation time)
> weighted_mean(high_prob_stars) = 1384.9069
>   0.070481 seconds (303.75 k allocations: 16.364 MiB, 99.41% compilation time)