Data Organization

Workspace Files and External Data

One of the first configurations for VisData is to pick the following two directories:

  • Workspace Files: A folder that stores all of the internal files for the current workspace. This is represented by the environment variable $WORKSPACE.

  • External Data: A folder that contains data for import. This is represented by the environment variable $EXTERNAL_DATA.

_images/workspace-and-external-data.png

VisData will read and write to/from workspace files while the external data location is read-only. The environment variables are seamlessly translated inside of the app.

Why are there two different directories and what should they be?

It is common when working with lots of data to have it stored on separate drives:

  • A working set of files that are immediately relevant on a fast SSD

  • A collection of unfiltered data on a large, slower disk

VisData supports this common workflow with the use of these two directories, but if desired the two locations can be on the same disk or even the same directory. If the user only plans on viewing a collaborators workspace (and does not require importing results or data) then the external directory can be ignored.

Warning

Avoid setting the directories to the root (/) of your filesystem. This can cause unintended performance issues within Docker.

Why use environment variables?

The short answer is portability. References to files in the workspace will use this environment variable convention. This allows the user to rename directories or share the workspace with collaborators without having to change any of the workspace configuration files. As long as the relative organization is kept the same, the workspace and external data will be consistent across different environments.

_images/example-external-data-reference.png

Example of using the $EXTERNAL_DATA environment variable when importing new data.

Multiple Workspaces

VisData is designed to easily switch context through the use of multiple workspaces.

What is the advantage of having multiple workspaces?

While VisData will work just fine with a single workspace, it may be convenient to split up work based on different projects or goals. For example the user may be working on one trade study that examines detection performance on small objects and another trade study that analyzes the impact of noise. While presenting results the user can launch the app with the desired workspace to avoid showing unnecessary data or distracting the audience.

Terminology

_images/Data-Primitives.svg _images/Data-Organization.svg

Definitions

Each dataset is organized under the following hierarchy:

  • workspace: The directory on disk that has all the files for the current project

  • suite: A top-level name that unifies all groups beneath it

  • group: A mid-level name that unifies all datasets beneath it

  • name: The name of a specific dataset

Every dataset must have a unique suite/group/name combination. This acts as a convenient, human-readable identifier for a dataset. Similar datasets can and should have an identical group and name.

Anatomy of a Workspace

{WORKSPACE}
├── frames                     # Image files. Can be added to large file revision control (e.g. dvc, git lfs)
│   ├── {SUITE_1}
│   │   └── {GROUP_A}
│   │       └── {NAME_α}/*.jpg
├── repo                       # JSON data. Can be committed to git or large file revision control depending on scale
│   ├── frame_meta             # FRAME-level data
│   │   └── {SUITE_1}
│   │       └── {GROUP_A}
│   │           └── {NAME_α}.json
│   ├── dataset_meta           # DATASET-level data
│   │   └── {SUITE_1}
│   │       └── {GROUP_A}
│   │           └── {NAME_α}.json
│   └── settings               # Defines workspace configuration such as data links (data import configuration)
│       └── ...
└── results                    # Algorithm results
    └── {SUITE_1}
        └── {GROUP_A}
            └── {NAME_α}
                └── {RESULT_NAME_a}.json

Generic Example

{WORKSPACE}
├── frames
│   ├── {SUITE_1}
│   │   └── {GROUP_A}
│   │       └── {NAME_α}/*.jpg
│   └── {SUITE_2}
│       ├── {GROUP_A}
│       │   ├── {NAME_α}/*.jpg
│       │   ├── {NAME_β}/*.jpg
│       │   └── {NAME_γ}/*.jpg
│       └── {GROUP_B}
│           ├── {NAME_α}/*.jpg
│           └── {NAME_β}/*.jpg
├── repo
│   ├── frame_meta
│   │   ├── {SUITE_1}
│   │   │   └── {GROUP_A}
│   │   │       └── {NAME_α}.json
│   │   └── {SUITE_2}
│   │       ├── {GROUP_A}
│   │       │   ├── {NAME_α}.json
│   │       │   ├── {NAME_β}.json
│   │       │   └── {NAME_γ}.json
│   │       └── {GROUP_B}
│   │           ├── {NAME_α}.json
│   │           └── {NAME_β}.json
│   ├── dataset_meta
│   │   ├── {SUITE_1}
│   │   │   └── {GROUP_A}
│   │   │       └── {NAME_α}.json
│   │   └── {SUITE_2}
│   │       ├── {GROUP_A}
│   │       │   ├── {NAME_α}.json
│   │       │   ├── {NAME_β}.json
│   │       │   └── {NAME_γ}.json
│   │       └── {GROUP_B}
│   │           ├── {NAME_α}.json
│   │           └── {NAME_β}.json
│   └── settings
│       └── ...
└── results
    └── {SUITE_1}
        └── {GROUP_A}
            ├── {NAME_α}
            │   ├── {RESULT_NAME_a}.json
            │   └── {RESULT_NAME_b}.json
            └── {NAME_β}
                ├── {RESULT_NAME_a}.json
                ├── {RESULT_NAME_b}.json
                └── {RESULT_NAME_c}.json