CAB-LAB and ESDC Access Offline

Hello,

I was wondering if there was any plans to update the ESDC THREDDS server with the latest datacubes? I noticed that the datacube there is out of date (i.e. version 1.0.1_1, 1.0.1_2) whereas the cube on the ESDL server is version 2.0.0. Is this server access going to be revived or is there another way to access the data that I’ve missed somewhere in the docs?


For those of you who are unaware of the instructions found on the CAB-LAB website, I’ve outlined a list of steps below with a snippet of code if you’re using Python 3.6+:

  1. The instructions can be found here on the CAB-LAB website with the links; in particular the one that says ESDC THREDDS server.
  2. Here the link to the directory where you can find the data.
  3. For example, you can navigate through the files until you land on soil moisture for example - here.
  4. You should be able to read the contents of the datacube.
  5. Unfortunately you can’t just call the nice xarray function to open everything in the folder so you have to do a bit of manual labour to find out what are the actual names of the folders and then import them in a list.
import xarray as xr

base_url = 'http://www.brockmann-consult.de/cablab-thredds/dodsC/datacube-low-res/data/soil_moisture/'
variable = 'soil_moisture'   # example variable
years = range(2001,2003)  # example years 

# get all of the files
files = [f"{base_url}{year}_{variable}.nc" for year in years]

# import the datacube
datacube = xr.open_mfdataset(files, combine='by_coords')

Some other resources:

  • Some xarray tips and tricks - blog post
  • xarray package issues with advice about if there is a username and password needed - github

Best,
Emmanuel

Hello Emmanuel,

thank you for the message. We will discuss updating th THREDDS server. Currenlty THREDDS only serves netCDFs, which is not suitable for the ESDC v.2.0.0. data cubes which are in zarr format.

For the mean time, you could download data cubes following the instructions on the https://jupyterhub.earthsystemdatalab.net -> shared-nb -> Python -> Tools -> download_cube_from_object_storage.ipynb

Does this help you for the moment?

Best regards,
Alicja

Hello,

Thank you for the reply!

I was able to find the code you referenced and I think it’s super straightforward. I guess the big disadvantage is that we would have to download the entire cube instead of only taking the bits one would like? I didn’t see anywhere in the script a way to only segment a subsection.

In any case as of now, for me at least, I think this is a good solution.

Thanks again,
Emmanuel

Hello,

So I was attempted to download the data via the new script that was sent to me but I kept getting the following error:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/emmanuel/.conda/envs/esdc/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/emmanuel/.conda/envs/esdc/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-78-ab6dfd083c1c>", line 2, in download_cube_file
    zdir_name = cube_file.split('/',1)[1]
AttributeError: 'tuple' object has no attribute 'split'
"""

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-79-6c6e5af8a294> in <module>
     11 
     12 pool = multiprocessing.Pool(4)
---> 13 pool.map(download_cube_file, f_list)

~/.conda/envs/esdc/lib/python3.7/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    269 
    270     def starmap(self, func, iterable, chunksize=None):

~/.conda/envs/esdc/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

AttributeError: 'tuple' object has no attribute 'split'

Attempt to Fix - Step I

So I tried to debug it myself because it was weird that it wasn’t recognizing any of the arguments. I was able to move forward and I created a new script where I have commented where there are modifications.

def download_cube_file(cube_file):
    
    # get directory name, returns a list of the path and filename
    zdir_name = cube_file[0].split('/',1)
    print(zdir_name)
    
    # check if path exists or not (INCLUDE OUTPUT PATH)
    if not os.path.exists(output_path + zdir_name[0]):
        os.makedirs(output_path + zdir_name[0])
    print(output_path + zdir_name[0])
    
    # Again, cubefile is a tuple. 
    f_name = os.path.join(output_path, zdir_name[0], cube_file[0].rsplit('/',1)[1])
    print(f_name)
    s3fs.S3FileSystem.get(s3, cube_file, filename = f_name)

Attempt to Fix - Step II

So I was able to get to the final line of the function. But then I got the following error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-142-f3adf6b39587> in <module>
     14 
     15 for icube_file in f_list:
---> 16     download_cube_file(icube_file)
     17     break

<ipython-input-141-d3812c002e5e> in download_cube_file(cube_file)
      9     f_name = os.path.join(output_path, zdir_name[0], cube_file[0].rsplit('/',1)[1])
     10     print(f_name)
---> 11     s3fs.S3FileSystem.get(s3, cube_file, filename = f_name)

TypeError: get() missing 1 required positional argument: 'lpath'

This is strange and I couldn’t find the positional argument anywhere in the docs (namely here).

Attempt to Fix - Step III

So finally, I changed the function slightly by using the method from the s3fs.core.S3FileSystem object resulting in the final function.

def download_cube_file(cube_file):
    
    # get directory name, cube_file is a tuple
    zdir_name = cube_file[0]
    print(zdir_name)
    
    # check if path exists or not (INCLUDE OUTPUT PATH)
    if not os.path.exists(output_path + zdir_name.split('/',1)[0]):
        os.makedirs(output_path + zdir_name.split('/',1)[0])
    print(output_path + zdir_name.split('/',1)[0])
    
    # Again, cubefile is a tuple. 
    f_name = os.path.join(output_path, zdir_name.split('/',1)[0], cube_file[0].rsplit('/',1)[1])
    print(f_name)
    s3.get(cube_file, f_name)

And again another error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-154-f3adf6b39587> in <module>
     14 
     15 for icube_file in f_list:
---> 16     download_cube_file(icube_file)
     17     break

<ipython-input-152-2fc857257ff4> in download_cube_file(cube_file)
     13     f_name = os.path.join(output_path, zdir_name.split('/',1)[0], cube_file[0].rsplit('/',1)[1])
     14     print(f_name)
---> 15     s3.get(cube_file, f_name)

~/.conda/envs/esdc/lib/python3.7/site-packages/fsspec/spec.py in get(self, rpath, lpath, recursive, **kwargs)
    559         (streaming through local).
    560         """
--> 561         rpath = self._strip_protocol(rpath)
    562         if recursive:
    563             rpaths = self.find(rpath)

~/.conda/envs/esdc/lib/python3.7/site-packages/fsspec/spec.py in _strip_protocol(cls, path)
    157         protos = (cls.protocol,) if isinstance(cls.protocol, str) else cls.protocol
    158         for protocol in protos:
--> 159             path = path.rstrip("/")
    160             if path.startswith(protocol + "://"):
    161                 path = path[len(protocol) + 3 :]

AttributeError: 'tuple' object has no attribute 'rstrip'

At this point I’m lost as I’m not familiar with the functions. I’m sure it’s an easy fix by someone who is familiar with the functions. Would someone be able to look into if possible?

Thanks,
Emmanuel

Hi Emmanuel,
could you tell me, which cube you would like to download? Does the error message appear right away, or does it take a while?

I will take a look into it, using the data cube you are trying to download.

Hello,

With the code I sent you, I was trying to download this cube:

name = 'CUBE_V2.0.0_global_time_optimized_0.25deg'

However, I tried it with all of the main cubes (0.25, 0.083, time_optimized, spatially_optimized) and the same thing happens. I just can’t seem to understand the input arguments to the s3.S3FileSystem.walk function.

The tuple error message pops up almost immediately. I suspect it doesn’t even through.

With the updated script I tried (see details) I was at least able to get it to run but I suspect the input arguments are incorrect based on the result because nothing downloads except filenames.

def download_cube_file(cube_file):
    
    # get directory name, cube_file is a tuple
    zdir_name = cube_file[0]
    
    # check if path exists or not (INCLUDE OUTPUT PATH)
    if not os.path.exists(output_path + zdir_name.split('/',1)[0]):
        os.makedirs(output_path + zdir_name.split('/',1)[0])
    
    # Again, cubefile is a tuple. 
    f_name = os.path.join(output_path, zdir_name.split('/',1)[0], cube_file[0].rsplit('/',1)[1])
    s3.get(cube_file[0], f_name)

dataset_descriptor = _cube_config[name]
path = dataset_descriptor.get('Path')

client_kwargs = {}
if 'Endpoint' in dataset_descriptor:
    client_kwargs['endpoint_url'] = dataset_descriptor['Endpoint']
if 'Region' in dataset_descriptor:
    client_kwargs['region_name'] = dataset_descriptor['Region']
s3 = s3fs.S3FileSystem(anon=True, client_kwargs=client_kwargs)
f_list = (s3fs.S3FileSystem.walk(s3,path))

# pool = multiprocessing.Pool(4)
# pool.map(download_cube_file, f_list)

for icube_file in f_list:
    download_cube_file(icube_file)
    break

Thanks for looking into it.

Emmanuel

Hello Emmanuel,

I think the error is due to a code mistake from my side. Please use the initial code as on jupyterhub, but exchange the download_cube_file function:

def download_cube_file(cube_file):
    zdir_name = cube_file.split('/',1)[1]
    if not os.path.exists(os.path.join(output_path,zdir_name.rsplit('/', 1)[0])):
        os.makedirs(os.path.join(output_path,zdir_name.rsplit('/', 1)[0]))
    f_name = os.path.join(output_path, zdir_name.rsplit('/', 1)[0], cube_file.rsplit('/',1)[1])
    s3fs.S3FileSystem.get(s3, cube_file, filename = f_name)

my outputpath looks like this:
output_path = '/home/alicja/test_download_cubes/'

It would be great if you can get back to me, if this fixed the problem (I forgot to place the output_path in the if part…)

Sorry about the mistake!

Thanks for looking into it. Unfortunately, I still get the same error with the tuple for the zdir_name param in the download_cube_file function.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-65-f3adf6b39587> in <module>
     14 
     15 for icube_file in f_list:
---> 16     download_cube_file(icube_file)
     17     break

<ipython-input-64-9d9e9c7a266e> in download_cube_file(cube_file)
      1 def download_cube_file(cube_file):
----> 2     zdir_name = cube_file.split('/',1)[1]
      3     if not os.path.exists(os.path.join(output_path,zdir_name.rsplit('/', 1)[0])):
      4         os.makedirs(os.path.join(output_path,zdir_name.rsplit('/', 1)[0]))
      5     f_name = os.path.join(output_path, zdir_name.rsplit('/', 1)[0], cube_file.rsplit('/',1)[1])

AttributeError: 'tuple' object has no attribute 'split'

It’s due to the f_list file that’s input into the download_cube_file function can’t be parsed because it comes out as a tuple (filepath, variable_names, extensions). Maybe I’m using a different version for the function s3fs.S3FileSystem.walk() and now it spits out a tuple instead of a string? I’ve listed my package versions below.

!python -V
s3fs.__version__
xr.show_versions()
Python 3.7.4
'0.3.5'
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.1

xarray: 0.13.0
pandas: 0.25.1
numpy: 1.17.2
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.0.21
cfgrib: None
iris: None
bottleneck: None
dask: 2.4.0
distributed: None
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: None
IPython: 7.8.0
sphinx: None

Hi Emmanuel,

you are right. This is a version issue. The code posted in the jupyter notebook was written for s3fs version 0.2.0

for version 0.3.5. the following code snippet should do the job:

def download_cube_file(cube_file):

    # (s3fs.S3FileSystem.walk(s3,path)) returns a tuple which is cube_file. The first part of the tuple is the path to the directory in a bucket. 
    # When the directory contains other directories, then they are listed in the second part. The hird part of the tuple lists the files of the cloud bucket. 
    
    # Taking the first position of the tuple of cube file and splitting it at '/' to get the direcory names within the bucket without the bucket name. 
    zdir_name = cube_file[0].split('/',1)[1]
    if not os.path.exists(os.path.join(output_path,zdir_name)):
        os.makedirs(os.path.join(output_path,zdir_name))
        
    # The first tuple returned by (s3fs.S3FileSystem.walk(s3,path)) returns the content of the files and directories of the data cube (e.g. time, lat, lon) 
    # in the second part of the tuple. 
    # when entering the subdirectories, the files of the directory are listed in the third part of the tuple.
    # There are cube file tuples, which do not have a three parts, and only contain None, these are not relevant for the download. 
    # These are ommitted by trying if there are three parts in the tuple:
    try:
        # i is a file in the direcotry, so for each file in the direcotry the destination path (dest_path) and the source path (source_path) are put together 
        # and downloaded by s3.get()
        for i in cube_file[2]:
            dest_path = os.path.join(output_path, zdir_name, i)
            source_path = f'{cube_file[0]}/{i}'
            s3.get(source_path, dest_path)
    except:
        pass

Fingers crossed, this is the final solution.
Please let me know, if this is working for you, and if not, which errors you get :slight_smile:
The downloading takes quite some time (I think it took about half an hour for the example cube).

Best,
Alicja

1 Like

Hey!

So the script you provided worked! I was able to successfully download the cube. After testing, I could read it and work with it on our servers without any problems. All is well!

Thank you so much for fixing it so that it works! And I now understand a bit more about how it works as well. With the updated cube, I can now resume my work with ease as the CAB-LAB cubes had a few errors that we had to go around.

Cheers!
Emmanuel