CATVISH GUIDE
Data Management
High-quality data is the most critical factor in model performance. Catvish provides a professional suite for importing, cleaning, and versioning your image datasets.
Importing Datasets
Navigate to the Data tab in the sidebar. To import data:
- Click the "Add Data" button in the top right.
- Drag and drop a folder containing images (JPG, PNG) into the drop zone. Catvish supports recursive directory scanning.
- (Optional) Upload an existing annotation file (COCO JSON or YOLO TXT) to pre-populate labels.
Duplicate Detection: During import, Catvish calculates the SHA-256 hash of every image. Exact duplicates are automatically flagged and can be removed with one click.
Data Versioning
Catvish treats datasets like code. Instead of separate folders like "dataset_v1", "dataset_v2", we use a commit-based versioning system.
How it Works
- Working Tree: This is your current, editable state. You can add images and change labels here freely.
- Committing: When you are happy with the state of the data, create a "Version". This freezes the dataset state (images + labels).
- Checkout: You can switch back to any previous version instantly. This is crucial for reproducing training results.
Split Configuration
Before training, data must be split into three subsets:
- Train (70%): Used by the model to learn.
- Validation (20%): Used during training to tune hyperparameters and check for overfitting.
- Test (10%): Held out completely to evaluate the final model performance.
You can adjust these ratios in the Versions tab. The split is deterministic based on the image hash, ensuring consistent sets across different machines.