aelab

Introduction

The goal of achieving net zero emissions by 2050 is a significant driver of research focused on understanding greenhouse gas (GHG) flux dynamics and the autotrophic production of ecosystems. This ambitious target highlights the urgent need to explore how different ecosystems contribute to GHG emissions and how they can be managed to reduce these emissions effectively. To support this research, this R package offers a variety of useful tools designed to facilitate the analysis of data related to GHG flux and production in aquatic environments.

This vignette is organized into four parts: GHG flux calculation, covering static-chamber flux measurements from raw analyser output through to a final flux estimate; NEM calculation, covering dissolved oxygen (DO) logger data and the resulting gross primary production (GPP), ecosystem respiration (ER), and net ecosystem metabolism (NEM); Statistical analysis, a set of helpers for descriptive statistics, outlier detection, normality testing, and group comparisons; and Visualization, plotting wrappers and colour palettes built on a shared theme.

GHG flux calculation

Various techniques are used to measure greenhouse gas (GHG) flux, including eddy covariance (EC), the boundary layer method (BLM), and the static chamber method. This study provides R functions specifically designed to analyze data from GHG flux measurements using the static chamber method with a portable gas analyzer.

Load and process the raw data file — tidy_ghg_analyzer()

First, we need to obtain necessary parameters from the raw data downloaded from the portable gas analyzer. We will remove unnecessary rows and columns and extract the GHG concentrations along with the corresponding date and time recorded by the analyzer using tidy_ghg_analyzer(). Additionally, this function will also eliminate any NaN values in the GHG data. Currently, this function supports raw data downloaded from:
(i) LI-COR Trace Gas Analyzer (LI-7810 and LI-7820). (ii) ABB LGR-ICOS Gas Analyzer (M-GGA-918).

# The provided file is a raw data file downloaded from 
# the LI-COR Trace Gas Analyzer
ghg_data_path <- system.file("extdata", "ch4.xlsx", package = "aelab", mustWork = T)
ch4 <- tidy_ghg_analyzer(ghg_data_path, "ch4")
ch4[c(1:3), ]
##         DATE     TIME      CO2      CH4           date_time
## 1 2023/03/11 07:31:59 799.9406 2999.952 2023-03-11 07:31:59
## 2 2023/03/11 07:32:00 770.1415 2995.596 2023-03-11 07:32:00
## 3 2023/03/11 07:32:01 771.6826 2993.581 2023-03-11 07:32:01

Sometimes, the date and time recorded by the analyzer do not match the actual date and time we recorded in situ. Therefore, we need to convert the date_time column from the previous step to align with the real-life time, if there are any discrepancies:

# The analyzer's time was assumed to be 
# 15 minutes and 30 seconds faster than real time
ch4 <- convert_time(ch4, min = -15, sec = 30)
ch4[c(1:3), c(5:6)]
##             date_time       real_datetime
## 1 2023-03-11 07:31:59 2023-03-11 07:17:29
## 2 2023-03-11 07:32:00 2023-03-11 07:17:30
## 3 2023-03-11 07:32:01 2023-03-11 07:17:31

Calculate the slope — calculate_regression()

We first need to obtain the slope from the linear regression of GHG concentration changes over time within the chamber to calculate the flux. This can be done by defining the measurement time interval from a reference file (see Method 1 below), directly in R (see Method 2 below), or from a compact inline string via make_reference() (see Method 3 below).

Method 1: Load from excel

In Excel, enter the date and time when the GHG flux measurement started, then load the file into R. Make sure the column containing the date and time information is formatted in ISO 8601 as YYYY-MM-DD hh:mm:ss.

ref_data_path <- system.file("extdata", "reference.xlsx", package = "aelab", mustWork = T)
ref <- read_excel(ref_data_path)
ref
## # A tibble: 3 × 5
##   real_date           real_time           date_time           site  analyzer
##   <dttm>              <dttm>              <dttm>              <chr> <chr>   
## 1 2023-03-11 00:00:00 1900-01-01 07:32:00 2023-03-11 07:32:00 S1    1337    
## 2 2023-03-11 00:00:00 1900-01-01 08:32:00 2023-03-11 08:32:00 S2    1337    
## 3 2023-03-11 00:00:00 1900-01-01 09:32:00 2023-03-11 09:32:00 S3    1337

Now, we can perform regressions on the GHG concentrations using calculate_regression(). The data frame passed to its reference_df argument — here, ref — must contain a date_time column (measurement start times), a site column, and an analyzer column (e.g. "1337").

Window type

Two window modes are available via window_type:

  • "sliding" (default): moves a fixed-length sub-window (default num_rows = 300 rows ≈ 5 min) across the full measurement period and keeps the position with the highest R². Useful when the exact measurement start is uncertain.
  • "fixed": fits a single window of duration_minutes starting at start_offset_s seconds after the reference time. More reproducible — all gases are evaluated over the same interval. The first 30 s of chamber deployment — the “dead-bar period” — are mechanically disturbed; skipping them with start_offset_s = 30 is recommended.
# Sliding window (default): best 5-min sub-window within 7-min period
calculate_regression(ch4, ref, ghg = "CH4", reference_time = "date_time",
                     site = "site", analyzer_code = "1337",
                     duration_minutes = 7, num_rows = 300)
## # A tibble: 3 × 12
##   start_time     end_time   slope r_square correlation b_param fit_used g_factor
##   <chr>          <chr>      <dbl>    <dbl>       <dbl>   <dbl> <chr>       <dbl>
## 1 2023/03/11 07… 2023/03… -0.0406    0.671      -0.819      NA linear         NA
## 2 2023/03/11 08… 2023/03… -0.207     0.671      -0.819      NA linear         NA
## 3 2023/03/11 09… 2023/03…  0.104     0.444       0.666      NA linear         NA
## # ℹ 4 more variables: flag <chr>, reference_time <dttm>, site <chr>,
## #   analyzer <chr>
# Fixed window: regress from t+30 s to t+7 min (skips dead-bar period)
calculate_regression(ch4, ref, ghg = "CH4", reference_time = "date_time",
                     site = "site", analyzer_code = "1337",
                     window_type = "fixed", start_offset_s = 30)
## # A tibble: 3 × 12
##   start_time     end_time   slope r_square correlation b_param fit_used g_factor
##   <chr>          <chr>      <dbl>    <dbl>       <dbl>   <dbl> <chr>       <dbl>
## 1 2023/03/11 07… 2023/03… -0.101    0.489       -0.700      NA linear         NA
## 2 2023/03/11 08… 2023/03…  0.0408   0.0869       0.295      NA linear         NA
## 3 2023/03/11 09… 2023/03…  0.0856   0.548        0.740      NA linear         NA
## # ℹ 4 more variables: flag <chr>, reference_time <dttm>, site <chr>,
## #   analyzer <chr>

Per-row window overrides

When reference_df contains a duration_minutes and/or start_offset_s column, a non-NA value in that column overrides the global parameter for that specific measurement. Rows without the column, or with NA, fall back to the global value. This is useful when individual closures have different deployment lengths or when one measurement needs a longer dead-bar skip.

Model selection

The fit_type parameter controls the regression model:

  • "linear" (default): ordinary least-squares slope. Valid when concentration change is small relative to ambient.
  • "quadratic": second-order polynomial; the reported slope is the initial rate at t = 0 (the coefficient of t).
  • "exp_tz" / "exp_zhao18": Hutchinson–Mosier nonlinear exponential models; slope evaluated at t_zero.
  • "auto": tries linear → quadratic → exp_tz in sequence, accepting a more complex model only when R² improves by > 0.01 and the exponential curvature parameter b does not exceed κmax (Hüppi et al. 2018). The model chosen for each measurement is recorded in the fit_used output column.

When fit_type = "auto", analyser precision is looked up automatically from the built-in analyzer_precision dataset (LI-7810, LI-7820, and LGR UGGA) if instrument_precision (see Quality flags below) is not supplied. The lookup matches the analyzer_code argument against model names case-insensitively, so "LGR" correctly resolves to LGR UGGA precision rather than falling back to LI-7810.

# Auto model selection with fixed window and 30-s dead-bar skip
calculate_regression(ch4, ref, ghg = "CH4", reference_time = "date_time",
                     site = "site", analyzer_code = "1337",
                     fit_type = "auto",
                     window_type = "fixed", start_offset_s = 30)
## # A tibble: 3 × 12
##   start_time     end_time   slope r_square correlation b_param fit_used g_factor
##   <chr>          <chr>      <dbl>    <dbl>       <dbl>   <dbl> <chr>       <dbl>
## 1 2023/03/11 07… 2023/03…  0.155     0.700      -0.700      NA quadrat…   -1.53 
## 2 2023/03/11 08… 2023/03… -0.392     0.741       0.295      NA quadrat…   -9.60 
## 3 2023/03/11 09… 2023/03… -0.0855    0.694       0.740      NA quadrat…   -0.999
## # ℹ 4 more variables: flag <chr>, reference_time <dttm>, site <chr>,
## #   analyzer <chr>

Quality flags

Each measurement receives a flag in the output:

Flag Meaning
ok R² ≥ threshold and correlation ≥ threshold; use this slope
discard Detectable trend but poor fit (R² below threshold)
zero No detectable trend (correlation below threshold, or slope within instrument noise)
no_data Insufficient data in the window

Two optional parameters tighten the quality control:

  • instrument_precision: analyser precision in ppm. Slopes whose absolute value falls below precision / window_duration_s are flagged "zero" even if R² passes. Also used by fit_type = "auto" to compute κmax per measurement.
  • g_factor_threshold: when fit_type = "auto" selects a nonlinear model, the g-factor (chosen slope / linear baseline slope) is computed and recorded in the g_factor output column. If abs(g_factor) > g_factor_threshold, the flag is overridden to "discard" (Gaudard et al. 2025). Default NULL (disabled).

Method 2: Type directly in R

The reference information can also be constructed directly in R as a data frame.

ref_direct <- data.frame(
  date_time = as.POSIXct("2023-03-11 07:32:00", tz = "Asia/Taipei"),
  site = "S1",
  analyzer = "1337"
)
calculate_regression(ch4, ref_direct, ghg = "CH4", reference_time = "date_time",
                     site = "site", analyzer_code = "1337")
## # A tibble: 1 × 12
##   start_time     end_time   slope r_square correlation b_param fit_used g_factor
##   <chr>          <chr>      <dbl>    <dbl>       <dbl>   <dbl> <chr>       <dbl>
## 1 2023/03/11 07… 2023/03… -0.0406    0.671      -0.819      NA linear         NA
## # ℹ 4 more variables: flag <chr>, reference_time <dttm>, site <chr>,
## #   analyzer <chr>

Method 3: Using make_reference()

make_reference() builds a reference_df from a compact multiline string — one closure per line, whitespace-separated: time site analyzer [dur=N] [off=N]. This avoids maintaining a separate Excel file for small campaigns or ad-hoc measurements.

make_reference(
  date = "2023-03-11",
  measurements = "
    07:32:00  S1  1337
    07:45:00  S2  1337
  "
)
## # A tibble: 2 × 3
##   site  date_time           analyzer
##   <chr> <dttm>              <chr>   
## 1 S1    2023-03-11 07:32:00 1337    
## 2 S2    2023-03-11 07:45:00 1337

Per-row window overrides use dur= and off= tokens. Rows without an override simply omit the token.

make_reference(
  date = "2023-03-11",
  measurements = "
    07:32:00  S1  1337  dur=3.5
    07:45:00  S2  1337  dur=2.0  off=45
    08:00:00  S3  1337
  "
)
## # A tibble: 3 × 5
##   site  date_time           analyzer duration_minutes start_offset_s
##   <chr> <dttm>              <chr>               <dbl>          <dbl>
## 1 S1    2023-03-11 07:32:00 1337                  3.5             NA
## 2 S2    2023-03-11 07:45:00 1337                  2               45
## 3 S3    2023-03-11 08:00:00 1337                 NA               NA

When output is specified, the tibble is also written to .xlsx:

make_reference(
  date = "2023-03-11",
  measurements = "
    07:32:00  S1  1337
    07:45:00  S2  1337
  ",
  output = "reference.xlsx"
)

Calculate the flux — calculate_ghg_flux()

The equation we used to calculate the GHG flux was derived from Yong et al. (2024):

\[ F = \frac{(S \times V \times c)}{R \times T \times A} \]

where S is the slope obtained from the linear regression of GHG concentration changes over time (ppm s\(^{-1}\)), V is the chamber volume (liters), c is the conversion factor from seconds to hours, R is the ideal gas constant (0.082 L atm K\(^{−1}\) mol\(^{−1}\)), T is the temperature inside the chamber (kelvin), and A is the surface area of the chamber (m\(^2\)).

Therefore, the function calculate_ghg_flux() requires a dataframe with columns for the slope, area, volume, and temperature. Note that the units of the parameters must match those described above, resulting in a flux with units of mmol m\(^{-2}\) d\(^{-1}\).

results_ch4 <- calculate_regression(ch4, ref_direct, ghg = "CH4", reference_time = "date_time",
                                    site = "site", analyzer_code = "1337")
# In practice, filter to flag == "ok" before calculating flux
flux_ch4 <- data.frame(
    slope  = results_ch4$slope,
    area   = 1, # in square meter
    volume = 1, # in litre
    temp   = 1) # in celsius
calculate_ghg_flux(flux_ch4)
##         slope area volume temp         flux         unit
## 1 -0.04058922    1      1    1 -0.006495451 mmol m-2 h-1

Additional GHG tools

convert_ghg_unit()

convert_ghg_unit() converts a GHG flux value between any supported flux units. The default output is µg m⁻² h⁻¹, but the target unit is fully configurable via to_mass, to_area, and to_time. The function is vectorized over numeric columns and works directly inside dplyr::mutate().

  • Numeric input (scalar or vector): returns a numeric value/vector.
  • Character string input (e.g., mean ± SD notation): converts each embedded number in-place and returns the modified string. Works in dplyr::mutate() for character columns.
  • NA input: returns NA without error.
# Numeric input: 97 mg CH4 m-2 h-1 → µg m-2 h-1 (default target)
convert_ghg_unit(97, ghg = "ch4", mass = "mg", area = "m2", time = "hr")
## [1] 97000
# Numeric vector: works directly in dplyr::mutate()
convert_ghg_unit(c(1.2, 0.8, NA), ghg = "ch4", mass = "µmol", area = "m2", time = "hr")
## [1] 19.25 12.83    NA
# Custom target unit: µmol m-2 h-1 → mmol m-2 d-1
convert_ghg_unit(97, ghg = "ch4", mass = "µmol", area = "m2", time = "hr",
                 to_mass = "mmol", to_area = "m2", to_time = "day")
## [1] 2.33
# Character string input: each number is converted individually, string is preserved
convert_ghg_unit("0.002 ± 0.003", ghg = "ch4", mass = "mmol", area = "m2", time = "hr")
## [1] "32.08 ± 48.12"

calculate_MDF()

calculate_MDF() calculates the Minimum Detectable Flux (MDF) for a static chamber system. This is useful for evaluating whether a gas analyser is precise enough to detect fluxes at the expected magnitude of the study site.

calculate_MDF(
  precision_ppm     = 1,
  closure_time_s    = 300,
  data_point_n      = 300,
  chamber_volume_m3 = 0.0064,
  temperature_C     = 25,
  chamber_area_m2   = 0.07
)
## $MDF
## [1] 113.953
## 
## $unit
## [1] "µg m⁻² h⁻¹"

calculate_total_co2e()

calculate_total_co2e() aggregates individual GHG fluxes (mg m⁻² h⁻¹) into a single total CO₂-equivalent flux (g m⁻² d⁻¹) using IPCC AR6 100-year Global Warming Potentials (GWPs; CH₄ = 27, N₂O = 273).

calculate_total_co2e(co2 = 4.02, ch4 = 0.001, n2o = 0.003)
## [1] 0.116784

NEM calculation

The change in oxygen concentration in the water reflects the balance between photosynthetic production, respiratory consumption, and physical exchange at the water-atmosphere interface. By measuring dissolved oxygen (DO) concentrations at high frequency with in situ data loggers, we can calculate gross primary production (GPP), ecosystem respiration (ER), and net ecosystem metabolism (NEM), as detailed by Staehr et al. (2010).

Load and process the HOBO logger data — process_hobo()

First, we need to obtain the necessary parameters from the raw data downloaded from the data loggers. process_hobo() removes unnecessary rows and columns, extracts the dissolved oxygen concentration and water temperature along with their date and time, drops rows with NA values, and labels the output with the supplied no_hobo identifier for use in later steps.

hobo_data_path <- system.file("extdata", "ex_hobo.csv", package = "aelab")
do <- process_hobo(hobo_data_path, no_hobo = "code_for_logger", type = "do")
do[c(1:3), ]
##             date_time   do  temp         no_hobo
## 1 2024-04-08 12:00:00 7.69 25.86 code_for_logger
## 2 2024-04-08 12:30:00 7.67 26.02 code_for_logger
## 3 2024-04-08 13:00:00 7.65 26.20 code_for_logger

process_hobo() accepts both Chinese (上午/下午) and English (AM/PM or 24-hour) HOBOware locale exports. For type = "do" (the default, used above for a HOBO U26 Dissolved Oxygen Data Logger), it also drops readings with a negative (sensor-error) DO or temperature value. Set type = "temp" to process a HOBO Pro v2 Temperature/RH Logger instead, which returns air_temp and rh columns in place of do and temp, with one row per reading.

The air pressure and wind speed at the sampling site should be obtained either manually or from a nearby weather station. process_weather() tidies the CSV file downloaded from Taiwan’s weather station. In addition to extracting the columns with air pressure and wind speed values, it also duplicates the hourly data to match the half-hourly recorded frequency of the DO concentrations.

weather_data_path <- system.file("extdata", "ex_weather.csv", package = "aelab")
weather <- process_weather(weather_data_path, date = "2024-04-10", zone = "zone_A")
weather[c(1:5), ]
##   pressure_hpa wind_ms           date_time   zone
## 1       1015.2     1.4 2024-04-10 00:30:00 zone_A
## 2       1015.2     1.4 2024-04-10 01:00:00 zone_A
## 3       1015.0     1.6 2024-04-10 01:30:00 zone_A
## 4       1015.0     1.6 2024-04-10 02:00:00 zone_A
## 5       1014.9     1.4 2024-04-10 02:30:00 zone_A

In addition to the previously mentioned parameters, we also need the water depth, salinity, start and end date and time of the data logger deployment, and the sunrise and sunset times during that period. These parameters can be entered in Excel and then imported into R using process_info() to ensure that all necessary parameters are included. The column zone corresponds to process_weather(), and column no_hobo corresponds to process_hobo().

info_data_path <- system.file("extdata", "info.xlsx", package = "aelab")
info <- process_info(info_data_path)
info
## # A tibble: 1 × 9
##   zone   site   no_hobo depth_m salinity start_date_time     end_date_time      
##   <chr>  <chr>  <chr>     <dbl>    <dbl> <dttm>              <dttm>             
## 1 zone_A site_A code_f…       1       10 2024-04-10 00:00:00 2024-04-10 23:59:00
## # ℹ 2 more variables: sunrise <dttm>, sunset <dttm>

We can then merge the air pressure and wind speed data from process_weather() with the processed logger data, combine the result with the remaining parameters using the zone and no_hobo columns, and use plot_hobo() to inspect the resulting DO concentrations.

data <- merge(weather, do, by = "date_time")
merged_df <- data %>% 
  inner_join(info, by = c("zone", "no_hobo"))
merged_df[c(1:3), ]
##             date_time pressure_hpa wind_ms   zone   do  temp         no_hobo
## 1 2024-04-10 00:30:00       1015.2     1.4 zone_A 8.28 25.10 code_for_logger
## 2 2024-04-10 01:00:00       1015.2     1.4 zone_A 8.07 25.04 code_for_logger
## 3 2024-04-10 01:30:00       1015.0     1.6 zone_A 7.80 24.98 code_for_logger
##     site depth_m salinity start_date_time       end_date_time
## 1 site_A       1       10      2024-04-10 2024-04-10 23:59:00
## 2 site_A       1       10      2024-04-10 2024-04-10 23:59:00
## 3 site_A       1       10      2024-04-10 2024-04-10 23:59:00
##               sunrise              sunset
## 1 1899-12-31 05:45:00 1899-12-31 18:15:00
## 2 1899-12-31 05:45:00 1899-12-31 18:15:00
## 3 1899-12-31 05:45:00 1899-12-31 18:15:00
plot_hobo(merged_df)

Batch processing with automated weather and sun times

For multi-site or multi-day deployments, the steps above can be replaced with a batch pipeline built from five functions: combine_hobo() reads every logger file in a folder, filter_complete_days() keeps only days with (near-)complete diel coverage, add_sun_times() computes sunrise and sunset from site coordinates instead of looking them up, and process_weather_mh() / combine_weather_mh() import weather in bulk from Taiwan’s Central Weather Administration (CWA) CODiS portal instead of one file per day.

Combine logger files in a folder — combine_hobo()

combine_hobo() applies process_hobo() to every file in a folder matching file_prefix and row-binds the result.

hobo_data_dir <- system.file("extdata", package = "aelab")
hobo_all <- combine_hobo(hobo_data_dir, file_prefix = "ex_ho")
hobo_all[c(1:3), ]
##             date_time   do  temp no_hobo
## 1 2024-04-08 12:00:00 7.69 25.86      bo
## 2 2024-04-08 12:30:00 7.67 26.02      bo
## 3 2024-04-08 13:00:00 7.65 26.20      bo

Keep only complete days — filter_complete_days()

filter_complete_days() drops calendar days where a logger has fewer than min_slots distinct 30-minute readings (default 44 of 48), so partial days at the start or end of a deployment don’t bias the daily metabolism estimate. Set summary_only = TRUE to inspect per-day coverage before filtering.

filter_complete_days(hobo_all, summary_only = TRUE)
##   no_hobo       date n_slots complete
## 1      bo 2024-04-08      24    FALSE
## 2      bo       <NA>       1    FALSE
## 3      bo 2024-04-09      47     TRUE
## 4      bo 2024-04-10      47     TRUE
## 5      bo 2024-04-11      47     TRUE
## 6      bo 2024-04-12      47     TRUE
## 7      bo 2024-04-13      47     TRUE
## 8      bo 2024-04-14      47     TRUE
## 9      bo 2024-04-15      19    FALSE
hobo_complete <- filter_complete_days(hobo_all)
nrow(hobo_complete)
## [1] 282

Sunrise and sunset from site coordinates — add_sun_times()

add_sun_times() attaches per-date sunrise and sunset columns computed from the site’s latitude and longitude using the suncalc package, removing the need to look these up manually for process_info().

hobo_complete <- add_sun_times(hobo_complete, lat = 22.3900, lon = 120.5777)
hobo_complete[c(1:3), ]
##             date_time   do  temp no_hobo       date             sunrise
## 1 2024-04-09 00:30:00 7.82 25.30      bo 2024-04-09 2024-04-09 05:44:08
## 2 2024-04-09 01:00:00 7.76 25.50      bo 2024-04-09 2024-04-09 05:44:08
## 3 2024-04-09 01:30:00 7.74 25.64      bo 2024-04-09 2024-04-09 05:44:08
##                sunset
## 1 2024-04-09 18:16:35
## 2 2024-04-09 18:16:35
## 3 2024-04-09 18:16:35

Bulk weather import

The CWA CODiS portal caps hourly downloads at one month, so a deployment typically spans several LotsDataReports.txt files in the “MH” (Multifield Hourly) format. process_weather_mh() parses one such file; combine_weather_mh() reads every .txt file in a directory (recursively), row-binds, and de-duplicates by station and time. Both expand hourly values to the 30-minute marks expected by calculate_do().

weather_mh_path <- system.file("extdata", "ex_weather_mh.txt", package = "aelab")
process_weather_mh(weather_mh_path, station_id = "C0R660")
##     stno           date_time pressure_hpa wind_ms air_temp humidity wind_dir
## 1 C0R660 2024-03-01 00:30:00       1012.0     0.3     25.4       97       94
## 2 C0R660 2024-03-01 01:00:00       1012.0     0.3     25.4       97       94
## 3 C0R660 2024-03-01 01:30:00       1011.4     0.3     25.0       98      109
## 4 C0R660 2024-03-01 02:00:00       1011.4     0.3     25.0       98      109
## 5 C0R660 2024-03-01 23:30:00       1015.7      NA     22.8       95        0
## 6 C0R660 2024-03-02 00:00:00       1015.7      NA     22.8       95        0
##   rain_mm
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
weather_dir <- system.file("extdata", package = "aelab")
combine_weather_mh(weather_dir, station_id = c("C0R660", "C0R850"))
##      stno           date_time pressure_hpa wind_ms air_temp humidity wind_dir
## 1  C0R660 2024-03-01 00:30:00       1012.0     0.3     25.4       97       94
## 2  C0R660 2024-03-01 01:00:00       1012.0     0.3     25.4       97       94
## 3  C0R660 2024-03-01 01:30:00       1011.4     0.3     25.0       98      109
## 4  C0R660 2024-03-01 02:00:00       1011.4     0.3     25.0       98      109
## 5  C0R660 2024-03-01 23:30:00       1015.7      NA     22.8       95        0
## 6  C0R660 2024-03-02 00:00:00       1015.7      NA     22.8       95        0
## 7  C0R850 2024-03-01 00:30:00       1009.1     1.2     24.7       90      124
## 8  C0R850 2024-03-01 01:00:00       1009.1     1.2     24.7       90      124
## 9  C0R850 2024-03-01 23:30:00       1010.2     0.9     23.1       92       34
## 10 C0R850 2024-03-02 00:00:00       1010.2     0.9     23.1       92       34
##    rain_mm
## 1      0.0
## 2      0.0
## 3      0.0
## 4      0.0
## 5      0.0
## 6      0.0
## 7      0.5
## 8      0.5
## 9      0.0
## 10     0.0

Putting it together

For each site, merge its hobo data with the matching station’s weather by date_time, attach site metadata, and run calculate_do():

weather <- combine_weather_mh("path/to/weather_mh", station_id = "C0R660")

hobo <- combine_hobo("path/to/logger_csvs/site_A")
hobo <- filter_complete_days(hobo)
hobo <- add_sun_times(hobo, lat = 22.3900, lon = 120.5777)
hobo$site <- "site_A"
hobo$depth_m <- 2.0
hobo$salinity <- 30

merged <- merge(hobo, weather[c("date_time", "pressure_hpa", "wind_ms")], by = "date_time")
calculate_do(merged)

Calculate GPP and ER — calculate_do()

calculate_do() returns one row per logger-day with the columns r_day (ecosystem respiration, ER), gpp (gross primary production, GPP), and nep (net ecosystem metabolism, NEM). NEM is calculated as GPP + ER.

calculate_do(merged_df)
##     site         no_hobo       date     r_day      gpp        nep
## 1 site_A code_for_logger 2024-04-10 -17.77689 17.56762 -0.2092709

Statistical analysis

aelab provides a small suite of statistical helpers that cover a typical aquatic ecology workflow: describe your data, check for outliers, test normality, transform if needed, and choose between a parametric or non-parametric group comparison.

The examples below use a simple self-contained data frame with three groups.

set.seed(42)
stat_df <- data.frame(
  group = rep(c("A", "B", "C"), each = 8),
  value = c(rnorm(8, 2, 0.4), rnorm(8, 3.5, 0.5), rnorm(8, 5, 0.6))
)

Descriptive statistics — descriptive_statistic()

descriptive_statistic() returns mean ± SD and min–max for each group.

descriptive_statistic(stat_df, vars = value, groups = group)
## # A tibble: 3 × 3
##   group value_mean_sd value_min_max
##   <chr> <chr>         <chr>        
## 1 A     2.18 ± 0.29   1.77 to 2.6  
## 2 B     3.77 ± 0.63   2.81 to 4.64 
## 3 C     4.62 ± 0.92   3.41 to 5.79

Outlier detection — find_outlier()

find_outlier() uses the IQR method (values outside 1.5 × IQR from Q1/Q3) to flag outliers.

find_outlier(stat_df, var = "value")
## # A tibble: 0 × 2
## # ℹ 2 variables: row_index <int>, outlier_value <dbl>

Normality testing

Use normality_test_t() when comparing two groups (t-test context), and normality_test_aov() when comparing three or more groups (ANOVA context). Both apply Shapiro-Wilk to raw, square-root, and log10 transforms so you can identify whether a transformation improves normality.

# Two-group context
normality_test_t(stat_df, "value", group, "A", "B")
## # A tibble: 6 × 4
##   variable group p_value transformation
##   <chr>    <chr>   <dbl> <chr>         
## 1 value    A       0.613 None          
## 2 value    B       0.733 None          
## 3 value    A       0.581 Square root   
## 4 value    B       0.691 Square root   
## 5 value    A       0.547 Logarithm     
## 6 value    B       0.619 Logarithm
# One-way ANOVA context (all three groups)
normality_test_aov(stat_df, "value", "group")
## # A tibble: 3 × 3
##   variable p_value transformation
##   <chr>      <dbl> <chr>         
## 1 value     0.848  None          
## 2 value     0.521  Square root   
## 3 value     0.0745 Logarithm

Data transformation — df_trans()

If residuals are not normal, df_trans() appends a transformed column to the data frame for re-testing.

stat_df_t <- df_trans(stat_df, "value", "sqrt")
head(stat_df_t)
##   group    value value_sqrt
## 1     A 2.548383   2.060021
## 2     A 1.774121   2.240078
## 3     A 2.145251   2.155648
## 4     A 2.253145   2.130475
## 5     A 2.161707   2.151827
## 6     A 1.957550   2.198754

Parametric test — aov_test()

When normality holds, aov_test() runs a one-way ANOVA followed by Tukey HSD post-hoc comparisons and returns compact letter display (CLD) groupings.

aov_test(stat_df, "value", "group")
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        2  24.60   12.30   27.98 1.19e-06 ***
## Residuals   21   9.23    0.44                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## $group
##   C   B   A 
## "a" "b" "c"
## $anova_summary
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        2  24.60   12.30   27.98 1.19e-06 ***
## Residuals   21   9.23    0.44                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $tukey_results
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: stats::aov(formula = formula, data = df)
## 
## $group
##          diff        lwr      upr     p adj
## B-A 1.5980249 0.76252027 2.433529 0.0002598
## C-A 2.4411682 1.60566363 3.276673 0.0000009
## C-B 0.8431434 0.00763877 1.678648 0.0476844
## 
## 
## $compact_letters
## $compact_letters$group
##   C   B   A 
## "a" "b" "c"

Non-parametric test — ks_test()

When normality cannot be assumed, ks_test() runs Kruskal-Wallis followed by Dunn’s post-hoc test with Bonferroni correction.

ks_test(stat_df, "value", "group")
## $ks_results
## 
##  Kruskal-Wallis rank sum test
## 
## data:  value by group
## Kruskal-Wallis chi-squared = 17.165, df = 2, p-value = 0.0001874
## 
## 
## $dunn_results
##   Comparison         Z      P.unadj        P.adj
## 1      A - B -2.722361 6.481727e-03 0.0194451820
## 2      A - C -4.065864 4.785484e-05 0.0001435645
## 3      B - C -1.343503 1.791092e-01 0.5373275782
## 
## $mean_summary
## # A tibble: 3 × 2
##   group  mean
##   <chr> <dbl>
## 1 A      2.18
## 2 B      3.77
## 3 C      4.62
## 
## $compact_letters
##   Group Letter MonoLetter
## 1     A      a         a 
## 2     B      b          b
## 3     C      b          b

Faceted CLD annotation — sig_labels()

sig_labels() wraps ks_test() to produce compact letter display (CLD) annotations for faceted ggplot2 boxplots. It iterates over each level of a faceting variable (by, default "year"), runs ks_test() on each subset, and returns a tidy data frame ready to pass directly to geom_text(). Facets where the Kruskal-Wallis p-value is ≥ alpha are silently dropped.

set.seed(1)
sig_df <- data.frame(
  year = rep(c("2023", "2024"), each = 20),
  site = rep(c("A", "B", "C", "D"), 10),
  flux = c(
    rnorm(20, mean = rep(c(1, 3, 2, 6), 5)),
    rnorm(20)
  )
)
labels <- sig_labels(sig_df, "flux", "site", by = "year")
labels
##   site Letter MonoLetter    y_pos year
## 1    A      a         a  1.575781 2023
## 2    B     ab         ab 3.943836 2023
## 3    C     ab         ab 3.511781 2023
## 4    D      b          b 7.595281 2023

Only 2023 appears in labels because 2024 has a non-significant Kruskal-Wallis result. Pass the output to geom_text() via the data argument to annotate a faceted boxplot:

ggplot(sig_df, aes(x = site, y = flux)) +
  geom_boxplot() +
  geom_text(
    data   = labels,
    aes(x = site, y = y_pos, label = Letter),
    vjust  = -0.5
  ) +
  facet_wrap(~year)

When the raw observations used for testing differ from the plotted data (e.g. the plot shows aggregated values), pass the plotted data as plot_data so y-positions are computed on the correct scale.

Visualization

aelab provides plot wrappers — plot_point(), plot_line(), plot_box(), and plot_bar() — with a consistent theme built on Century Gothic. Custom colour palettes are available through aelab_palettes() and the discrete and continuous scale functions scale_colour_aelab_d(), scale_fill_aelab_d(), scale_colour_aelab_c(), and scale_fill_aelab_c().

The example below previews the "ghg" palette and applies it to a box plot comparing CH₄ fluxes across three habitat types.

# Preview the "ghg" palette
aelab_palettes("ghg")

# Box plot with aelab palette
df_vis <- data.frame(
  site = rep(c("Mangrove", "Mudflat", "Seagrass"), each = 10),
  ch4  = c(rnorm(10, 2, 0.5), rnorm(10, 1, 0.3), rnorm(10, 0.5, 0.2))
)

plot_box(df_vis, site, ch4, site) +
  scale_fill_aelab_d("ghg")

References

Gaudard, J., Telford, R. J., & Halbritter, A. H. (2025). fluxible: An R package to process ecosystem gas fluxes from closed-loop chambers in an automated and reproducible way. Methods in Ecology and Evolution, 00, 1–9. https://doi.org/10.1111/2041-210X.70161

Hüppi, R., Felber, R., Krauss, M., Six, J., Leifeld, J., & Fuß, R. (2018). Restricting the nonlinearity parameter in soil greenhouse gas flux calculations for more reliable flux estimates. PLOS ONE, 13(7), e0200876. https://doi.org/10.1371/journal.pone.0200876

Staehr, P. A., Bade, D., Van de Bogert, M. C., Koch, G. R., Williamson, C., Hanson, P., Cole, J. J., & Kratz, T. (2010). Lake metabolism and the diel oxygen technique: State of the science. Limnology and Oceanography: Methods, 8(11), 628–644. https://doi.org/10.4319/lom.2010.8.0628

Yong, Z.-J., Lin, W.-J., Lin, C.-W., & Lin, H.-J. (2024). Tidal influence on carbon dioxide and methane fluxes from tree stems and soils in mangrove forests. Biogeosciences, 21(22), 5247–5260. https://doi.org/10.5194/bg-21-5247-2024