OCEAN ICE Data Catalogue

OCEAN ICE Data Catalogue#

For an interactive version of this page please visit the Google Colab:
Open in Google Colab

_{(To open link in new tab press Ctrl + click)}

Alternatively this notebook can be opened with Binder by following the link: OCEAN ICE Data Catalogue

Purpose

This notebook builds an interactive catalog of datasets related to the OCEAN ICE project. It allows users to:

Browse datasets available on the OCEAN ICE ERDDAP server and Zenodo repositories.
Display key metadata (title, description, creators, funders, license, DOI, coverage).
Interactively select variables and temporal ranges.
Generate direct download links for the selected subsets in CSV format.

This catalog provides a single access point for exploring OCEAN ICE observational and modeling datasets, making discovery, metadata inspection and data download.

Data sources

The sources are:

Zenodo OCEAN ICE Community: A curated list of Zenodo DOIs is queried via the Zenodo API to retrieve metadata (title, description, authors, funders, license, DOI and citation).
OCEAN ICE ERDDAP server: Metadata from the ERDDAP endpoint allDatasets is parsed to get dataset titles, structures and metadata links. These ERDDAP datasets include gridded (griddap) and tabular (tabledap) collections with rich metadata (spatial/temporal coverage, variables, licensing).

Instructions to use this Notebook

To interact with the notebook, run each code cell sequentially, You can do this by clicking the Play button (▶️) on the left side of each grey code block. Executing the cells in order ensure that all features and visualizations work properly.

Explaining the code

1. Import required libraries & define data sources

This section loads Python libraries:

requests, re – HTTP requests & string cleaning.
pandas – handle metadata tables.
ipywidgets – build dropdowns, checkboxes, date pickers, and buttons.
IPython.display – render widgets, tables, and links inline.
datetime – manage time coverage metadata.

and defines the sources:

*a list of Zenodo IDs relevant to OCEAN ICE.

ERDDAP API URLs (allDatasets, metadata, and base access).
the fields of interest to be displayed for each dataset.

# @title
import requests
import pandas as pd
import re
from ipywidgets import (
    Dropdown,
    Checkbox,
    VBox,
    GridspecLayout,
    Checkbox,
    Button,
    DatePicker,
    Label
)
from IPython.display import (
    display,
    Javascript
)
from datetime import (
    datetime,
    timedelta
)

zenodo_ids = [
    '15747365',
    '15590997',
    '15189061',
    '15267996',
    '15268272',
    '15268317',
    '15299425',
    '15299650',
    '15299705',
    '15280675',
    '15181349',
    '14162776',
    '14041098',
    '11652686',
    '11096059',
    '11652686',
    '12581210',
    '11096232',
    '14193092'
]

ZENODO_URL = 'https://zenodo.org/api/records/'
ALL_DATASET_URL = 'https://er1.s4oceanice.eu/erddap/tabledap/allDatasets.csv?datasetID%2Ctitle%2CdataStructure%2Cmetadata'
BASE_URL = 'https://er1.s4oceanice.eu/erddap/'

fields_of_interest = [
                      'title',
                      'summary',
                      'conventions',
                      'creator_name',
                      'creator_type',
                      'creator_url',
                      'institution',
                      'project',
                      'project_url',
                      'infoUrl',
                      'license',
                      'citation',
                      'funding',
                      'doi',
                      'time_coverage_end',
                      'time_coverage_start',
                      'geospatial_lat_min',
                      'geospatial_lat_max',
                      'geospatial_lon_min',
                      'geospatial_lon_max']

zenodo_fields = [
                  'title',
                  'description',
                  'creators',
                  'funder',
                  'doi',
                  'license',
                  'citation',
                 ]

2. Retrieve and parse Zenodo metadata

In this step, each Zenodo ID is processed to build a structured metadata record::

the Zenodo API is queried to fetch the dataset’s metadata.
from the response, key fields are extracted, including the dataset’s title, description, list of creators, funding sources, DOI and license.
since the descriptions often contain HTML tags, these are cleaned out to make the text more readable.
a properly formatted citation string is then created by combining the author names, dataset title, and DOI.
finally, all the cleaned and structured information is stored in a Pandas DataFrame, which makes it easy to explore or use in later parts of the notebook.

# @title
zenodo_data = []

for zenodo_id in zenodo_ids:
    url = f'{ZENODO_URL}{zenodo_id}'
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        data = response.json()
        metadata = data.get('metadata', {})
        links = data.get('links', {}) # Get the links dictionary

        creators = [creator.get('name') for creator in metadata.get('creators', []) if creator.get('name')]
        # Corrected extraction of funder names
        funder = [grant.get('funder', {}).get('name') for grant in metadata.get('grants', []) if grant.get('funder', {}).get('name')]


        # Construct citation
        citation_parts = []
        if creators:
            citation_parts.append(", ".join(creators))
        if metadata.get('title'):
            citation_parts.append(metadata.get('title'))
        if metadata.get('doi'):
            citation_parts.append(f"DOI: {metadata.get('doi')}")

        citation = ". ".join(citation_parts) if citation_parts else None

        # Clean HTML from description
        description = metadata.get('description')
        if description:
            clean = re.compile('<.*?>')
            description = re.sub(clean, '', description)


        record = {
            'title': metadata.get('title'),
            'description': description, # Use the cleaned description
            'creators': creators,
            'funder': funder,
            'doi': metadata.get('doi'),
            'license': metadata.get('license').get('id'),
            'citation': citation,
            'self_html': links.get('self_html') # Add self_html link
        }
        zenodo_data.append(record)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data for Zenodo ID {zenodo_id}: {e}")
    except Exception as e:
        print(f"An unexpected error occurred for Zenodo ID {zenodo_id}: {e}")


zenodo_df = pd.DataFrame(zenodo_data)

2. Fetching ERDDAP Dataset Catalog

Here, the notebook retrieves a list of all datasets available through the ERDDAP server:

the allDatasets.csv endpoint is queried from ERDDAP.
the dataset titles are extracted and combined with those retrieved from Zenodo.
a dropdown menu is created, allowing the user to select from the combined list of Zenodo and ERDDAP datasets.
this selection serves as the entry point for exploring dataset metadata in the next steps.

# @title
try:
  df = pd.read_csv(ALL_DATASET_URL)
  #display(df)
except Exception as e:
  print('ERROR: ', e)

# Check if both df and zenodo_df exist and have a 'title' column
if 'df' in globals() and df is not None and 'title' in df.columns and 'zenodo_df' in globals() and zenodo_df is not None and 'title' in zenodo_df.columns:
    # Combine titles from both dataframes
    erddap_titles = df['title'].dropna().tolist()
    zenodo_titles = zenodo_df['title'].dropna().tolist()
    all_titles = erddap_titles + zenodo_titles

    options = all_titles

    # Set the initial value of the dropdown only if options is not empty
    initial_value = options[0] if options else None

    dropdown = Dropdown(
        options=options,
        description='Dataset:',
        value=initial_value
    )

    # Display the dropdown and the metadata table together
    # The metadata will be loaded and displayed when a selection is made

else:
    print("DataFrames or the 'title' column are not available.")
    # Create an empty dropdown or display a message if dataframes are not available
    options = ["No datasets available"]
    # Set the initial value of the dropdown only if options is not empty
    initial_value = options[0] if options else None
    dropdown = Dropdown(
        options=options,
        description='Dataset:',
        value=initial_value
    )


display(dropdown)

3. Loading metadata for the selected dataset

Once a dataset is chosen from the dropdown, its metadata is loaded and structured:

if the dataset comes from Zenodo, the notebook extracts the relevant fields directly from the Zenodo metadata table.
if the dataset comes from ERDDAP, the corresponding metadata .csv is downloaded and parsed.
key attributes (e.g., title, project, institution, geospatial and temporal coverage) are extracted and stored in a DataFrame.

This ensures that both Zenodo and ERDDAP datasets can be handled in a unified way, even though their metadata formats differ.

Note: Run this code everytime the selection of the dataset changes.

# @title
def load_selected_dataset(change):
    global new_df
    global metadata_df
    global zenodo_html_link # Make the link available globally
    selected_dataset_id = change['new']
    zenodo_html_link = None # Reset the link

    # Check if the selected dataset is in the zenodo_df
    is_zenodo_dataset = selected_dataset_id in zenodo_df['title'].values

    if is_zenodo_dataset:
        # If it's a Zenodo dataset, extract information from zenodo_df
        zenodo_record = zenodo_df[zenodo_df['title'] == selected_dataset_id].iloc[0]
        metadata_to_display = {}
        for field in zenodo_fields:
            if field in zenodo_record.index:
                metadata_to_display[field] = zenodo_record[field]

        # Store the self_html link globally for the button
        if 'self_html' in zenodo_record.index:
            zenodo_html_link = zenodo_record['self_html']


        if metadata_to_display:
            metadata_df = pd.DataFrame.from_dict(metadata_to_display, orient='index', columns=['Value'])
            metadata_df.index.name = 'Attribute'
            # Since Zenodo data is already loaded, set new_df to None or an empty DataFrame
            # as there's no separate CSV to load in this flow.
            new_df = pd.DataFrame() # Or None, depending on how new_df is used later
        else:
             print(f"Could not find metadata for Zenodo dataset: {selected_dataset_id}")

    else:
        # If not a Zenodo dataset, assume it's an ERDDAP dataset and proceed as before
        try:
            metadata_url = df[df['title'] == selected_dataset_id]['metadata'].iloc[0]
            csv_url = metadata_url + '.csv'

            new_df = pd.read_csv(csv_url)

            # Extract and display metadata
            metadata_to_display = {}

            for field in fields_of_interest:
                if field in new_df['Attribute Name'].values:
                    metadata_to_display[field] = new_df[new_df['Attribute Name'] == field]['Value'].iloc[0]

            if metadata_to_display:
                metadata_df = pd.DataFrame.from_dict(metadata_to_display, orient='index', columns=['Value'])
                metadata_df.index.name = 'Attribute'
                #print(f"Successfully loaded data and metadata for ERDDAP dataset: {selected_dataset_id}")


        except Exception as e:
            print(f"ERROR loading data for dataset: {selected_dataset_id}")
            print(e)


dropdown.observe(load_selected_dataset, names='value')

if dropdown.value:
    load_selected_dataset({'new': dropdown.value})

4. Building interactive widgets for exploration

Here, the notebook creates interactive tools to refine what part of the dataset to explore:

variable checkboxes are generated, listing all available variables for the selected dataset.
date pickers are created, based on the dataset’s reported start and end dates, so the user can filter by time range.
if the dataset is from Zenodo, a button is also added that links directly to the Zenodo landing page.

Together, these widgets let the user choose variables, restrict time periods and explore metadata interactively.

Note: Run this code everytime the selection of the dataset changes.

# @title
def create_variable_checkboxes(df):
    """Creates checkboxes for variables in the DataFrame and arranges them in a grid."""
    if df is None or df.empty or 'Row Type' not in df.columns or 'Variable Name' not in df.columns:
        print("DataFrame is not valid or missing required columns.")
        return None

    variable_names = df[df['Row Type'] == 'variable']['Variable Name'].dropna().unique().tolist()

    if not variable_names:
        print("No variables found in the DataFrame.")
        return None

    num_variables = len(variable_names)
    num_cols = 4
    num_rows = (num_variables + num_cols - 1) // num_cols

    grid = GridspecLayout(num_rows, num_cols)

    for i, var_name in enumerate(variable_names):
        row = i // num_cols
        col = i % num_cols
        grid[row, col] = Checkbox(description=var_name, value=False)

    return grid

def create_time_select(df):
    """Creates date pickers based on time_coverage_start and time_coverage_end."""
    start_date_str = df[df['Attribute Name'] == 'time_coverage_start']['Value'].iloc[0] if 'time_coverage_start' in df['Attribute Name'].values else None
    end_date_str = df[df['Attribute Name'] == 'time_coverage_end']['Value'].iloc[0] if 'time_coverage_end' in df['Attribute Name'].values else None

    if start_date_str and end_date_str:
        try:
            start_date = datetime.fromisoformat(start_date_str.replace('Z', '+00:00')).date()
            end_date = datetime.fromisoformat(end_date_str.replace('Z', '+00:00')).date()

            start_date_picker = DatePicker(
                description='Start Date:',
                value=start_date,
                min=start_date,
                max=end_date,
                disabled=False
            )
            end_date_picker = DatePicker(
                description='End Date:',
                value=end_date,
                min=start_date,
                max=end_date,
                disabled=False
            )

            return VBox([start_date_picker, end_date_picker])

        except Exception as e:
            print(f"Error creating time select: {e}")
            return None
    else:
        print("time_coverage_start or time_coverage_end not found in metadata.")
        return None

def create_zenodo_link_button(url):
    """Creates a button that opens the given URL in a new tab when clicked."""
    button = Button(description="View on Zenodo")

    def on_button_click(b):
        display(Javascript(f'window.open("{url}");'))

    button.on_click(on_button_click)
    return button


if 'new_df' in globals() and new_df is not None and not new_df.empty:
    checkbox_grid = create_variable_checkboxes(new_df)
    time_select_widget = create_time_select(new_df)

    if 'metadata_df' in globals() and metadata_df is not None:
        print("Metadata:")
        display(metadata_df)


    if checkbox_grid and time_select_widget:
        display(VBox([Label(""), checkbox_grid, Label(""), time_select_widget]))
    elif checkbox_grid:
        display(VBox([Label(""), checkbox_grid]))
    elif time_select_widget:
        display(VBox([Label(""), time_select_widget]))
    else:
        print("No widgets to display.")
elif 'metadata_df' in globals() and metadata_df is not None and not metadata_df.empty:
    print("Metadata:")
    display(metadata_df)
    # Add the Zenodo link button if the link is available
    if 'zenodo_html_link' in globals() and zenodo_html_link:
        zenodo_button = create_zenodo_link_button(zenodo_html_link)
        display(zenodo_button)
else:
  print("Please select a Dataset from the dropdown menu")

time_coverage_start or time_coverage_end not found in metadata.
Metadata:

	Value
Attribute
title	* The List of All Active Datasets in this ERDD...
summary	This dataset is a table which has a row of inf...
creator_name	ETT Ricerca
creator_url	https://er1.s4oceanice.eu/erddap
institution	ETT S.p.A. - People and Technology
infoUrl	https://er1.s4oceanice.eu/erddap
license	The data may be used and redistributed for fre...

5. Generating download links for ERDDAP datasets

Note: Run this cell only after selecting an ERDDAP dataset if you want to download the corresponding data in .csv format. If you change your dataset, variables, or time range, make sure to re-run this cell to update the download link.

Finally, the notebook enables downloading filtered data directly from ERDDAP:

based on the dataset selection, the chosen variables and any date filters, a query URL is constructed.
the query is checked against the server to confirm whether valid data is available.
if the data exists, a Download button is displayed, opening the dataset in .csv format.
if no data is available for the given variables or dates, the user receives a clear error message.

# @title
def generate_download_url(dropdown_widget, checkbox_grid, df, base_url, time_select_widget=None):
    """Generates the download URL based on dropdown and checkbox selections."""
    selected_dataset_title = dropdown_widget.value
    if not selected_dataset_title:
        return "Please select a dataset."

    dataset_info = df[df['title'] == selected_dataset_title]
    if dataset_info.empty:
        return f"Could not find information for dataset: {selected_dataset_title}"

    selected_dataset_id = dataset_info['datasetID'].iloc[0] # Get the datasetID

    data_structure = dataset_info['dataStructure'].iloc[0]
    if data_structure == 'table':
        dap_type = 'tabledap'
    elif data_structure == 'grid':
        dap_type = 'griddap'
    else:
        return f"Unknown data structure: {data_structure}"

    selected_variables = []
    if checkbox_grid:
        # Check if checkbox_grid is a GridspecLayout or a single Checkbox
        if isinstance(checkbox_grid, GridspecLayout):
            for row in checkbox_grid.children:
                if isinstance(row, Checkbox) and row.value:
                    selected_variables.append(row.description)
        elif isinstance(checkbox_grid, Checkbox) and checkbox_grid.value:
             selected_variables.append(checkbox_grid.description)


    if not selected_variables:
        return "Please select at least one variable."

    variables_string = "%2C".join(selected_variables)

    url = f"{base_url}{dap_type}/{selected_dataset_id}.csv?{variables_string}"


    # Add time constraints if time_select_widget is available
    if time_select_widget and isinstance(time_select_widget, VBox):
        start_date_picker = time_select_widget.children[0]
        end_date_picker = time_select_widget.children[1]
        start_date = start_date_picker.value
        end_date = end_date_picker.value

        if start_date and end_date:
            # Format dates as required by ERDDAP (usually ISO 8601) without the 'Z'
            start_date_str = start_date.isoformat()
            end_date_str = end_date.isoformat()
            url += f"&time>={start_date_str}&time<={end_date_str}"

    return url

# Assuming download_url is the URL generated from the previous step
if 'checkbox_grid' in globals() and checkbox_grid is not None:
    download_url = generate_download_url(dropdown, checkbox_grid, df, BASE_URL, time_select_widget)

    def create_download_button(url):
        """Creates a button that opens the given URL in a new tab when clicked."""
        button = Button(description="Download Data")

        def on_button_click(b):
            display(Javascript(f'window.open("{url}");'))

        button.on_click(on_button_click)
        return button

    # Assuming download_url is the URL generated from the previous step
    if download_url:
        # Check if the generated download_url is an error message or a URL
        if download_url.startswith("http://") or download_url.startswith("https://"):
            # It's a URL, now check if it returns a 404
            try:
                response = requests.head(download_url)
                if response.status_code == 404:
                    print("Error: Data not found for the selected variables and time range. Please select other or more variables.")
                else:
                    download_button = create_download_button(download_url)
                    display(download_button)
            except requests.exceptions.RequestException as e:
                print(f"Error checking URL: {e}")
        else:
            # If it's not a URL, it's likely the "Please select at least one variable." message
            print(download_url)
else:
  print("Program is waiting for dataset selection from the dropdown menu.")

Please select at least one variable.