Allow user to specify start/end rows of data to be processed for training/validation/test | Features Requests

Allow user to specify start/end rows of data to be processed for training/validation/test

PerceptiLabs

So only the data between the rows specified gets split into either training/validation/test.

All other data remain unused.

Created by Robert Lundberg

November 14, 2021

Julian Moore

In addition to that option, it would also be very useful and flexible if each row could have a train/validation/test category and a selector like "numeric", "target" for "Model Phase", which if used (specification of use/ignore should be separate) would allow specific rows to be used for each purpose.
(user should also be able to select row ranges quickly and simply to test model functionality with a limited dataset before advancing to using the model)
Motivation: imagenet, imagenet 100 have images in folders, counting them up and trying to a) determine b) document & apply specific split percentages is too hard to make such datasets easily usable.
It should be possible and quick to modify the data set specification so that minimal re-processing is required.

Robert Lundberg

Julian Moore: Good comments, thanks! :)
Just for clarification, do I understand you correctly that you would want to be able to add a single column to the CSV which specifies if a row goes into training/validation/testing?
I lost you a bit on the "numeric" selector. Unless you mean to use a 2nd column in the CSV to define which batch each row should go in?

Julian Moore

Robert Lundberg: Yes, the user may include a column that indicates whether the data row is to be used for training, validation or test. To avoid needing a specific column label I suggested that the "Select" drop down [I forgot there were two; not the image/numeric drop down :) ] that currently allows user to state one of "Do not use" "Input", "Target" etc. should include the label "Model Phase" to indicate the intended use of that column (it can of course be ignored, in which case the % controls would be effective). Any unambiguous, case insensitive strings should be usable as training/validation/test indicators, i.e. all of the following would be valid {tr, v, va, te, test} and these would be invalid: {t, validating, X, true, False}

If import goes via a dataframe, partitioning according to these values is trivial once you have a fn to determine Train/Val/Test from the provided strings

Robert Lundberg

Julian Moore: That makes sense, thanks for clarifying! :)