Tutorial #2 - Preprocessing options

Load the PyEEM library and display version

[1]:
import pyeem
print(pyeem.__version__)
0.1.1

Check out the supported instruments

[2]:
pyeem.instruments.supported
[2]:
name
manufacturer supported_models
Agilent Cary 4E cary_4e
Cary Eclipse cary_eclipse
Horiba Aqualog-880-C aqualog
SPEX Fluorolog-3 fluorolog
Tecan Spark spark

Check out the demo datasets

[3]:
demos_df = pyeem.datasets.demos
display(demos_df)

print("Dataset description for the drEEM demo:")
print(demos_df[
    demos_df["demo_name"] == "drEEM"
]["description"].item())
demo_name description citation DOI absorbance_instrument water_raman_instrument EEM_instrument
0 rutherford Excitation Emission Matrix (EEM) fluorescence ... Rutherford, Jay W., et al. "Excitation emissio... 10.1016/j.atmosenv.2019.117065 Aqualog None Aqualog
1 drEEM The demo dataset contains measurements made du... Murphy, Kathleen R., et al. "Fluorescence spec... 10.1039/c3ay41160e Cary 4E Fluorolog Fluorolog
Dataset description for the drEEM demo:
The demo dataset contains measurements made during four surveys of San Francisco Bay that took place in spring, summer, autumn and winter 2006 (Murphy et al. 2013, J. Mar. Syst. 111-112, 157-166).

Download the drEEM demo dataset from S3

  • Please note that this step requires an internet connection because the data is downloaded from an AWS S3 bucket.

[4]:
demo_data_dir = pyeem.datasets.download_demo(
    "demo_data",
    demo_name="drEEM"
)
Download Demo Dataset from S3: 100%|██████████| 260/260 [00:00<00:00, 163031.70it/s]

Load the dataset

[5]:
demo_data_dir = "demo_data/drEEM"

dataset = pyeem.datasets.Dataset(
    data_dir=demo_data_dir,
    raman_instrument="fluorolog",
    absorbance_instrument="cary_4e",
    eem_instrument="fluorolog",
    mode="w"
)
WARNING: No Sample EEM scans were found in sample set 1.
WARNING: No corresponding absorbance scan for sample EEM sample_eem2.csv in sample set 8. There should be an absorbance measurement named absorb2.csv in this sample set.

Let’s checkout the metadata

  • The metadata contains information about collected sample sets which are composed of a few different scan types.

[6]:
display(dataset.meta_df)
datetime_utc filename collected_by description comments water_raman_wavelength dilution_factor cruise Site Rep SampID filepath name hdf_path
sample_set scan_type
1 water_raman 2006-04-27 00:00:00 water_raman1.csv Anon 275.0 1.0 SF-p-win 1B 3.0 478.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... water_raman1 raw_sample_sets/1/water_raman1
blank_eem 2006-04-27 11:59:30 blank_eem1.csv Anon NaN 1.0 SF-p-win 1B 3.0 478.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... blank_eem1 raw_sample_sets/1/blank_eem1
absorb 2006-04-27 23:59:00 absorb1.csv Anon NaN 1.0 SF-p-win 1B 3.0 478.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... absorb1 raw_sample_sets/1/absorb1
2 water_raman 2006-04-28 00:00:00 water_raman1.csv Anon 275.0 1.0 SF-p-win 1B 2.0 477.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... water_raman1 raw_sample_sets/2/water_raman1
blank_eem 2006-04-28 05:59:45 blank_eem1.csv Anon NaN 1.0 SF-p-win 1B 2.0 477.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... blank_eem1 raw_sample_sets/2/blank_eem1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15 absorb 2006-09-14 23:59:00 absorb11.csv Anon NaN 1.0 SF-p-win 3A 3.0 493.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... absorb11 raw_sample_sets/15/absorb11
16 water_raman 2006-09-15 00:00:00 water_raman1.csv Anon 275.0 1.0 SF-p-win 3A 1.0 491.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... water_raman1 raw_sample_sets/16/water_raman1
blank_eem 2006-09-15 07:59:40 blank_eem1.csv Anon NaN 1.0 SF-p-win 3A 1.0 491.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... blank_eem1 raw_sample_sets/16/blank_eem1
sample_eem 2006-09-15 15:59:20 sample_eem1.csv Anon NaN 1.0 SF-p-win 3A 1.0 491.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... sample_eem1 raw_sample_sets/16/sample_eem1
absorb 2006-09-15 23:59:00 absorb1.csv Anon NaN 1.0 SF-p-win 3A 1.0 491.0 /home/roboat/Documents/roboat/PyEEM/docs/sourc... absorb1 raw_sample_sets/16/absorb1

182 rows × 14 columns

Checkout the metadata summary information

[7]:
dataset.metadata_summary_info()
[7]:
Start datetime (UTC) End datetime (UTC) Number of sample sets Number of blank EEMs Number of sample EEMs Number of water raman scans Number of absorbance scans
0 2006-04-27 2006-09-15 23:59:00 16 16 73 16 77
[8]:
from IPython.display import HTML
fig_kws = {"dpi": 200}
anim = pyeem.plots.water_raman_peak_animation(dataset, excitation_wavelength=275, fig_kws=fig_kws)
HTML(anim.to_html5_video())
[8]:
../../_images/tutorials_notebooks_tutorial_2_15_1.png
[9]:
import matplotlib.pyplot as plt
fig_kws={"dpi": 95}
plot_kws = {"fmt": "o-"}
kwargs = {"byweekday": 0}
ax = pyeem.plots.water_raman_timeseries(
    dataset,
    excitation_wavelength=275,
    fig_kws=fig_kws,
    plot_kws=plot_kws,
    **kwargs
)
plt.show()
../../_images/tutorials_notebooks_tutorial_2_16_0.png

Create a preprocessing routine

  • The demo dataset contains raw scans, in order to analyze and interpret this data, we must first apply several preprocessing steps.

[10]:
routine_df = pyeem.preprocessing.create_routine(
    crop = False,
    discrete_wavelengths = False,
    gaussian_smoothing = False,
    blank_subtraction = True,
    inner_filter_effect = True,
    raman_normalization = True,
    scatter_removal = True,
    dilution = False,
)

display(routine_df)
step_name hdf_path
step_order
0 raw raw_sample_sets/
1 blank_subtraction preprocessing/corrections/blank_subtraction
2 inner_filter_effect preprocessing/corrections/inner_filter_effect
3 raman_normalization preprocessing/corrections/raman_normalization
4 scatter_removal preprocessing/corrections/scatter_removal
5 complete preprocessing/complete/

Execute the preprocessing routine

  • Each preprocessing step has certain knobs and dials you can tune to have them run to your liking. It is worth checking the documentation to learn more about these customizations.

  • Please note that depending on the steps and settings you’ve chosen as well as your dataset’s size, the time it takes for this step to complete will vary.

[11]:
kwargs = {
    "raman_source_type": "water_raman",
    "water_raman_wavelength": 275,
    "excision_width": 30,
    "fill": "interp",
}


routine_results_df = pyeem.preprocessing.perform_routine(
    dataset,
    routine_df,
    progress_bar=True,
    **kwargs
)

display(routine_results_df)
Preprocessing scan sets: 100%|██████████| 16/16 [00:38<00:00,  2.43s/it]
step_completed step_exception hdf_path units
sample_set scan_type name step_name
1 blank_eem blank_eem1 raw True None raw_sample_sets/1/blank_eem1 Intensity, AU
2 blank_eem blank_eem1 raw True None raw_sample_sets/2/blank_eem1 Intensity, AU
sample_eem sample_eem1 raw True None raw_sample_sets/2/sample_eem1 Intensity, AU
blank_subtraction True None preprocessing/corrections/blank_subtraction/2/... Intensity, AU
inner_filter_effect True None preprocessing/corrections/inner_filter_effect/... Intensity, AU
... ... ... ... ... ... ... ...
16 sample_eem sample_eem1 blank_subtraction True None preprocessing/corrections/blank_subtraction/16... Intensity, AU
inner_filter_effect True None preprocessing/corrections/inner_filter_effect/... Intensity, AU
raman_normalization True None preprocessing/corrections/raman_normalization/... Intensity, RU
scatter_removal True None preprocessing/corrections/scatter_removal/16/s... Intensity, RU
complete True None preprocessing/complete/16/sample_eem1 Intensity, RU

454 rows × 4 columns

Check to see if any of the steps failed to complete

  • If you are using a demo dataset, you should see an empty dataframe.

[12]:
display(routine_results_df[
    routine_results_df["step_exception"].notna()
])
step_completed step_exception hdf_path units
sample_set scan_type name step_name
8 sample_eem sample_eem2 inner_filter_effect False 'No object named raw_sample_sets/8/absorb2 in ... None None

Visualize the preprocessing steps for a single sample

[13]:
import matplotlib.pyplot as plt

sample_set = 16
sample_name = "sample_eem1"
axes = pyeem.plots.preprocessing_routine_plot(
    dataset,
    routine_results_df,
    sample_set=sample_set,
    sample_name=sample_name,
    plot_type="contour",
    fig_kws={"dpi": 200},
)
plt.show()
../../_images/tutorials_notebooks_tutorial_2_24_0.png
[14]:
routine_df = pyeem.preprocessing.create_routine(
    crop = False,
    discrete_wavelengths = False,
    gaussian_smoothing = False,
    blank_subtraction = True,
    inner_filter_effect = False,
    raman_normalization = False,
    scatter_removal = True,
    dilution = False,
)

display(routine_df)
step_name hdf_path
step_order
0 raw raw_sample_sets/
1 blank_subtraction preprocessing/corrections/blank_subtraction
2 scatter_removal preprocessing/corrections/scatter_removal
3 complete preprocessing/complete/
[15]:
routine_results_df = pyeem.preprocessing.perform_routine(
    dataset,
    routine_df,
    fill = None,
    excision_width = 25,
    progress_bar=True
)

axes = pyeem.plots.preprocessing_routine_plot(
    dataset,
    routine_results_df,
    sample_set=sample_set,
    sample_name=sample_name,
    plot_type="contour",
    fig_kws={"dpi": 200},
)
plt.show()
Preprocessing scan sets: 100%|██████████| 16/16 [00:03<00:00,  4.09it/s]
../../_images/tutorials_notebooks_tutorial_2_26_1.png
[16]:
routine_results_df = pyeem.preprocessing.perform_routine(
    dataset,
    routine_df,
    raman_source_type = "water_raman",
    fill = None,
    truncate = "both",
    progress_bar=True
)

axes = pyeem.plots.preprocessing_routine_plot(
    dataset,
    routine_results_df,
    sample_set=sample_set,
    sample_name=sample_name,
    plot_type="contour",
    fig_kws={"dpi": 200},
)
plt.show()
Preprocessing scan sets: 100%|██████████| 16/16 [00:03<00:00,  4.21it/s]
../../_images/tutorials_notebooks_tutorial_2_27_1.png
[17]:
routine_results_df = pyeem.preprocessing.perform_routine(
    dataset,
    routine_df,
    raman_source_type = "water_raman",
    fill = None,
    band="rayleigh",
    order="first",
    excision_width=20,
    progress_bar=True
)

axes = pyeem.plots.preprocessing_routine_plot(
    dataset,
    routine_results_df,
    sample_set=sample_set,
    sample_name=sample_name,
    plot_type="contour",
    fig_kws={"dpi": 200},
)
plt.show()
Preprocessing scan sets: 100%|██████████| 16/16 [00:03<00:00,  4.09it/s]
../../_images/tutorials_notebooks_tutorial_2_28_1.png
[18]:
routine_results_df = pyeem.preprocessing.perform_routine(
    dataset,
    routine_df,
    raman_source_type = "water_raman",
    fill = "interp",
    band="both",
    excision_width = 25,
    progress_bar=True
)

axes = pyeem.plots.preprocessing_routine_plot(
    dataset,
    routine_results_df,
    sample_set=sample_set,
    sample_name=sample_name,
    plot_type="imshow",
)
plt.show()
Preprocessing scan sets: 100%|██████████| 16/16 [00:08<00:00,  1.97it/s]
../../_images/tutorials_notebooks_tutorial_2_29_1.png
[ ]:
routine_results_df = pyeem.preprocessing.perform_routine(
    dataset,
    routine_df,
    raman_source_type = "water_raman",
    fill = None,
    band="rayleigh",
    order="first",
    truncate="below",
    excision_width = 25,
    progress_bar=True
)

axes = pyeem.plots.preprocessing_routine_plot(
    dataset,
    routine_results_df,
    sample_set=sample_set,
    sample_name=sample_name,
    plot_type="imshow",
)
plt.show()
[ ]: