{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "BgJs6pCvKM-W"
},
"source": [
"# **OCEAN ICE Data Catalogue** #"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For an interactive version of this page please visit the Google Colab: \n",
"[ Open in Google Colab ](https://colab.research.google.com/drive/1SxGbDLXVHGNMr5m-fgPJDEw_Vg9J6ZhB)
\n",
"\n",
"(To open link in new tab press Ctrl + click)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively this notebook can be opened with Binder by following the link:\n",
"[OCEAN ICE Data Catalogue](https://mybinder.org/v2/gh/s4oceanice/literacy.s4oceanice/main?urlpath=%2Fdoc%2Ftree%2Fnotebooks_binder%2Foceanice_catalogue.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KpT-fwm6gKv2"
},
"source": [
"**Purpose**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wn20YT_xgODt"
},
"source": [
"This notebook builds an interactive catalog of datasets related to the OCEAN ICE project.\n",
"It allows users to:\n",
"\n",
"* Browse datasets available on the **OCEAN ICE ERDDAP server** and Z**enodo repositories**.\n",
"\n",
"* Display key metadata (title, description, creators, funders, license, DOI, coverage).\n",
"\n",
"* Interactively select variables and temporal ranges.\n",
"\n",
"* Generate direct download links for the selected subsets in CSV format.\n",
"\n",
"This catalog provides a single access point for exploring OCEAN ICE observational and modeling datasets, making discovery, metadata inspection and data download."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sd8ROkeDgdpm"
},
"source": [
"**Data sources**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3luXOGQEghIf"
},
"source": [
"The sources are:\n",
"* **Zenodo OCEAN ICE Community**:\n",
"A curated list of Zenodo DOIs is queried via the Zenodo API to retrieve metadata (title, description, authors, funders, license, DOI and citation).\n",
"\n",
"* **OCEAN ICE ERDDAP server**:\n",
"Metadata from the ERDDAP endpoint `allDatasets`\n",
" is parsed to get dataset titles, structures and metadata links.\n",
"These ERDDAP datasets include gridded (griddap) and tabular (tabledap) collections with rich metadata (spatial/temporal coverage, variables, licensing)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BMALrzcJhRdo"
},
"source": [
"**Instructions to use this Notebook**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OV4_p5A5hTgr"
},
"source": [
"To interact with the notebook, run each code cell sequentially, You can do this by clicking the **Play button** (▶️) on the left side of each grey code block. Executing the cells in order ensure that all features and visualizations work properly."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mFK3L0afhd4y"
},
"source": [
"**Explaining the code**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "79QsK2VFhgFL"
},
"source": [
"**1. Import required libraries & define data sources**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UIq9hgjVhn0E"
},
"source": [
"This section loads Python libraries:\n",
"\n",
"* requests, re – HTTP requests & string cleaning.\n",
"\n",
"* pandas – handle metadata tables.\n",
"\n",
"* ipywidgets – build dropdowns, checkboxes, date pickers, and buttons.\n",
"\n",
"* IPython.display – render widgets, tables, and links inline.\n",
"\n",
"* datetime – manage time coverage metadata.\n",
"\n",
"and defines the sources:\n",
"\n",
"*a list of Zenodo IDs relevant to OCEAN ICE.\n",
"\n",
"* ERDDAP API URLs (allDatasets, metadata, and base access).\n",
"\n",
"* the fields of interest to be displayed for each dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "4o6l7XRLo3SZ"
},
"outputs": [],
"source": [
"# @title\n",
"import requests\n",
"import pandas as pd\n",
"import re\n",
"from ipywidgets import (\n",
" Dropdown,\n",
" Checkbox,\n",
" VBox,\n",
" GridspecLayout,\n",
" Checkbox,\n",
" Button,\n",
" DatePicker,\n",
" Label\n",
")\n",
"from IPython.display import (\n",
" display,\n",
" Javascript\n",
")\n",
"from datetime import (\n",
" datetime,\n",
" timedelta\n",
")\n",
"\n",
"zenodo_ids = [\n",
" '15747365',\n",
" '15590997',\n",
" '15189061',\n",
" '15267996',\n",
" '15268272',\n",
" '15268317',\n",
" '15299425',\n",
" '15299650',\n",
" '15299705',\n",
" '15280675',\n",
" '15181349',\n",
" '14162776',\n",
" '14041098',\n",
" '11652686',\n",
" '11096059',\n",
" '11652686',\n",
" '12581210',\n",
" '11096232',\n",
" '14193092'\n",
"]\n",
"\n",
"ZENODO_URL = 'https://zenodo.org/api/records/'\n",
"ALL_DATASET_URL = 'https://er1.s4oceanice.eu/erddap/tabledap/allDatasets.csv?datasetID%2Ctitle%2CdataStructure%2Cmetadata'\n",
"BASE_URL = 'https://er1.s4oceanice.eu/erddap/'\n",
"\n",
"fields_of_interest = [\n",
" 'title',\n",
" 'summary',\n",
" 'conventions',\n",
" 'creator_name',\n",
" 'creator_type',\n",
" 'creator_url',\n",
" 'institution',\n",
" 'project',\n",
" 'project_url',\n",
" 'infoUrl',\n",
" 'license',\n",
" 'citation',\n",
" 'funding',\n",
" 'doi',\n",
" 'time_coverage_end',\n",
" 'time_coverage_start',\n",
" 'geospatial_lat_min',\n",
" 'geospatial_lat_max',\n",
" 'geospatial_lon_min',\n",
" 'geospatial_lon_max']\n",
"\n",
"zenodo_fields = [\n",
" 'title',\n",
" 'description',\n",
" 'creators',\n",
" 'funder',\n",
" 'doi',\n",
" 'license',\n",
" 'citation',\n",
" ]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tXDnyuXniDd_"
},
"source": [
"**2. Retrieve and parse Zenodo metadata**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aKgJpyzuiG-3"
},
"source": [
"In this step, each **Zenodo ID** is processed to build a structured metadata record::\n",
"\n",
"* the **Zenodo API** is queried to fetch the dataset’s metadata.\n",
"\n",
"* from the response, key fields are extracted, including the dataset’s *title, description, list of creators, funding sources, DOI* and *license*.\n",
"\n",
"* since the descriptions often contain HTML tags, these are cleaned out to make the text more readable.\n",
"\n",
"* a properly formatted **citation string** is then created by combining the author names, dataset title, and DOI.\n",
"\n",
"* finally, all the cleaned and structured information is stored in a **Pandas DataFrame**, which makes it easy to explore or use in later parts of the notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "n-tAKn01ytfP"
},
"outputs": [],
"source": [
"# @title\n",
"zenodo_data = []\n",
"\n",
"for zenodo_id in zenodo_ids:\n",
" url = f'{ZENODO_URL}{zenodo_id}'\n",
" try:\n",
" response = requests.get(url)\n",
" response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)\n",
" data = response.json()\n",
" metadata = data.get('metadata', {})\n",
" links = data.get('links', {}) # Get the links dictionary\n",
"\n",
" creators = [creator.get('name') for creator in metadata.get('creators', []) if creator.get('name')]\n",
" # Corrected extraction of funder names\n",
" funder = [grant.get('funder', {}).get('name') for grant in metadata.get('grants', []) if grant.get('funder', {}).get('name')]\n",
"\n",
"\n",
" # Construct citation\n",
" citation_parts = []\n",
" if creators:\n",
" citation_parts.append(\", \".join(creators))\n",
" if metadata.get('title'):\n",
" citation_parts.append(metadata.get('title'))\n",
" if metadata.get('doi'):\n",
" citation_parts.append(f\"DOI: {metadata.get('doi')}\")\n",
"\n",
" citation = \". \".join(citation_parts) if citation_parts else None\n",
"\n",
" # Clean HTML from description\n",
" description = metadata.get('description')\n",
" if description:\n",
" clean = re.compile('<.*?>')\n",
" description = re.sub(clean, '', description)\n",
"\n",
"\n",
" record = {\n",
" 'title': metadata.get('title'),\n",
" 'description': description, # Use the cleaned description\n",
" 'creators': creators,\n",
" 'funder': funder,\n",
" 'doi': metadata.get('doi'),\n",
" 'license': metadata.get('license').get('id'),\n",
" 'citation': citation,\n",
" 'self_html': links.get('self_html') # Add self_html link\n",
" }\n",
" zenodo_data.append(record)\n",
"\n",
" except requests.exceptions.RequestException as e:\n",
" print(f\"Error fetching data for Zenodo ID {zenodo_id}: {e}\")\n",
" except Exception as e:\n",
" print(f\"An unexpected error occurred for Zenodo ID {zenodo_id}: {e}\")\n",
"\n",
"\n",
"zenodo_df = pd.DataFrame(zenodo_data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZjY73aBsjM9R"
},
"source": [
"**2. Fetching ERDDAP Dataset Catalog**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RGn1fupyjRJh"
},
"source": [
"Here, the notebook retrieves a list of all datasets available through the ERDDAP server:\n",
"\n",
"* the **allDatasets.csv** endpoint is queried from ERDDAP.\n",
"\n",
"* the dataset titles are extracted and combined with those retrieved from Zenodo.\n",
"\n",
"* a dropdown menu is created, allowing the user to select from the combined list of Zenodo and ERDDAP datasets.\n",
"\n",
"* this selection serves as the entry point for exploring dataset metadata in the next steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 49,
"referenced_widgets": [
"6d221a8a50444b3bba4485d8cacff85c",
"724c285085aa479faf60ff84f983fe6f",
"3f30fb46cf1842bc8a13de2acc8a07cd"
]
},
"id": "c487ee6f",
"outputId": "8baa023f-327c-4f03-82d0-154426f9653e"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6d221a8a50444b3bba4485d8cacff85c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Dropdown(description='Dataset:', options=('* The List of All Active Datasets in this ERDDAP *', 'AAD - ASPeCt-…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# @title\n",
"try:\n",
" df = pd.read_csv(ALL_DATASET_URL)\n",
" #display(df)\n",
"except Exception as e:\n",
" print('ERROR: ', e)\n",
"\n",
"# Check if both df and zenodo_df exist and have a 'title' column\n",
"if 'df' in globals() and df is not None and 'title' in df.columns and 'zenodo_df' in globals() and zenodo_df is not None and 'title' in zenodo_df.columns:\n",
" # Combine titles from both dataframes\n",
" erddap_titles = df['title'].dropna().tolist()\n",
" zenodo_titles = zenodo_df['title'].dropna().tolist()\n",
" all_titles = erddap_titles + zenodo_titles\n",
"\n",
" options = all_titles\n",
"\n",
" # Set the initial value of the dropdown only if options is not empty\n",
" initial_value = options[0] if options else None\n",
"\n",
" dropdown = Dropdown(\n",
" options=options,\n",
" description='Dataset:',\n",
" value=initial_value\n",
" )\n",
"\n",
" # Display the dropdown and the metadata table together\n",
" # The metadata will be loaded and displayed when a selection is made\n",
"\n",
"else:\n",
" print(\"DataFrames or the 'title' column are not available.\")\n",
" # Create an empty dropdown or display a message if dataframes are not available\n",
" options = [\"No datasets available\"]\n",
" # Set the initial value of the dropdown only if options is not empty\n",
" initial_value = options[0] if options else None\n",
" dropdown = Dropdown(\n",
" options=options,\n",
" description='Dataset:',\n",
" value=initial_value\n",
" )\n",
"\n",
"\n",
"display(dropdown)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xVqJ7EaYnGl6"
},
"source": [
"**3. Loading metadata for the selected dataset**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jjYUX1Meku-5"
},
"source": [
"Once a dataset is chosen from the dropdown, its metadata is loaded and structured:\n",
"\n",
"* if the dataset comes from **Zenodo**, the notebook extracts the relevant fields directly from the Zenodo metadata table.\n",
"\n",
"* if the dataset comes from **ERDDAP**, the corresponding metadata `.csv` is downloaded and parsed.\n",
"\n",
"* key attributes (e.g., title, project, institution, geospatial and temporal coverage) are extracted and stored in a DataFrame.\n",
"\n",
"This ensures that both Zenodo and ERDDAP datasets can be handled in a unified way, even though their metadata formats differ.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f3VoJ4n6m03R"
},
"source": [
"**Note**: Run this code everytime the selection of the dataset changes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "7e115aa3"
},
"outputs": [],
"source": [
"# @title\n",
"def load_selected_dataset(change):\n",
" global new_df\n",
" global metadata_df\n",
" global zenodo_html_link # Make the link available globally\n",
" selected_dataset_id = change['new']\n",
" zenodo_html_link = None # Reset the link\n",
"\n",
" # Check if the selected dataset is in the zenodo_df\n",
" is_zenodo_dataset = selected_dataset_id in zenodo_df['title'].values\n",
"\n",
" if is_zenodo_dataset:\n",
" # If it's a Zenodo dataset, extract information from zenodo_df\n",
" zenodo_record = zenodo_df[zenodo_df['title'] == selected_dataset_id].iloc[0]\n",
" metadata_to_display = {}\n",
" for field in zenodo_fields:\n",
" if field in zenodo_record.index:\n",
" metadata_to_display[field] = zenodo_record[field]\n",
"\n",
" # Store the self_html link globally for the button\n",
" if 'self_html' in zenodo_record.index:\n",
" zenodo_html_link = zenodo_record['self_html']\n",
"\n",
"\n",
" if metadata_to_display:\n",
" metadata_df = pd.DataFrame.from_dict(metadata_to_display, orient='index', columns=['Value'])\n",
" metadata_df.index.name = 'Attribute'\n",
" # Since Zenodo data is already loaded, set new_df to None or an empty DataFrame\n",
" # as there's no separate CSV to load in this flow.\n",
" new_df = pd.DataFrame() # Or None, depending on how new_df is used later\n",
" else:\n",
" print(f\"Could not find metadata for Zenodo dataset: {selected_dataset_id}\")\n",
"\n",
" else:\n",
" # If not a Zenodo dataset, assume it's an ERDDAP dataset and proceed as before\n",
" try:\n",
" metadata_url = df[df['title'] == selected_dataset_id]['metadata'].iloc[0]\n",
" csv_url = metadata_url + '.csv'\n",
"\n",
" new_df = pd.read_csv(csv_url)\n",
"\n",
" # Extract and display metadata\n",
" metadata_to_display = {}\n",
"\n",
" for field in fields_of_interest:\n",
" if field in new_df['Attribute Name'].values:\n",
" metadata_to_display[field] = new_df[new_df['Attribute Name'] == field]['Value'].iloc[0]\n",
"\n",
" if metadata_to_display:\n",
" metadata_df = pd.DataFrame.from_dict(metadata_to_display, orient='index', columns=['Value'])\n",
" metadata_df.index.name = 'Attribute'\n",
" #print(f\"Successfully loaded data and metadata for ERDDAP dataset: {selected_dataset_id}\")\n",
"\n",
"\n",
" except Exception as e:\n",
" print(f\"ERROR loading data for dataset: {selected_dataset_id}\")\n",
" print(e)\n",
"\n",
"\n",
"dropdown.observe(load_selected_dataset, names='value')\n",
"\n",
"if dropdown.value:\n",
" load_selected_dataset({'new': dropdown.value})"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BArZv4AqnQGx"
},
"source": [
"**4. Building interactive widgets for exploration**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Uxc2hiRKnUyJ"
},
"source": [
"Here, the notebook creates interactive tools to refine what part of the dataset to explore:\n",
"\n",
"* **variable checkboxes** are generated, listing all available variables for the selected dataset.\n",
"\n",
"* **date pickers** are created, based on the dataset’s reported start and end dates, so the user can filter by time range.\n",
"\n",
"* if the dataset is from **Zenodo**, a button is also added that links directly to the Zenodo landing page.\n",
"\n",
"Together, these widgets let the user choose variables, restrict time periods and explore metadata interactively."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SdvHHoUqnj7S"
},
"source": [
"**Note**: Run this code everytime the selection of the dataset changes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 349,
"referenced_widgets": [
"f108cecb95e4483ba16bdfbe16b9ad37",
"d0aed5db18694a18b3bbc48e9e44ecd5",
"90a355920c764adba471b35c54c767fb"
]
},
"id": "1cf425d8",
"outputId": "b1161b3e-0d17-47c8-a6a1-e3d86a3dc926"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Metadata:\n"
]
},
{
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"summary": "{\n \"name\": \"metadata_df\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"Attribute\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"title\",\n \"description\",\n \"license\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Value\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
"type": "dataframe",
"variable_name": "metadata_df"
},
"text/html": [
"\n",
"
\n", " | Value | \n", "
---|---|
Attribute | \n", "\n", " |
title | \n", "FESOM 2. Bellingshausen and Amundsen Seas Exp... | \n", "
description | \n", "The reference run was plublished in DOI:\\nIn t... | \n", "
creators | \n", "[van Caspel, Mathias, Janout, Markus, Timmerma... | \n", "
funder | \n", "[European Commission] | \n", "
doi | \n", "10.5281/zenodo.15299650 | \n", "
license | \n", "cc-by-4.0 | \n", "
citation | \n", "van Caspel, Mathias, Janout, Markus, Timmerman... | \n", "