Skip to main content

Finalizing Synthetic Data Creation

Once the input type and model configuration are completed, users can fine-tune the synthetic dataset by selecting specific columns, applying filters, and reviewing the final summary before generating the data.

Steps to Finalize and Generate the Dataset

1. Final Summary

The Final Summary tab gives an overview of the entire configuration for generating the synthetic dataset. Users can review the following sections:

  1. Modal Configuration:

    • Name: The name assigned to the dataset.
    • Modeling Type: The modeling approach (e.g., LLM - Large Language Model).
    • Selected Model: The model chosen for the task, such as GPT-4.
    • Prompt: If any custom prompt has been provided for guiding the data generation.

    The final summary ensures that the user sees the model details and the generation settings before proceeding.

  2. Data Configuration:

    • Input Options: Specifies whether the input was a file or a database.
    • File Type: If file input was chosen, it shows whether the input was JSON, Text, etc.
    • Generated Sample Size: The number of data samples being generated (e.g., 10 data points).
    • Temperature and Top P: These values control the creativity and diversity of the generated data.
    • Imputed Columns: Any columns where missing data will be imputed.
    • Rebalancing Options: If the dataset is being rebalanced based on certain conditions or fairness-sensitive columns.

    These configuration settings provide detailed control over how the dataset is generated, allowing the user to fine-tune the synthetic data output.


2. Data Output Options

After reviewing the final summary, the user can select how to export the generated synthetic dataset:

  1. Download as a File:

    • Allows the user to export the dataset in formats such as JSON, CSV, or any other specified file type.
  2. Store in Database:

    • Alternatively, the dataset can be stored directly in a connected database for further analysis.

    These flexible output options ensure that the generated data can be used effectively, whether for immediate download or integration with an existing database system.


Example Output Configuration:

The images indicate an example of the output configuration:

  • Download Format: JSON
  • Type: enum (customizable based on the output type)
  • Generated Data Points: 10

Users can then click on the Start Generations button to begin generating the synthetic data based on the configured settings.


Conclusion

The final summary screen provides a comprehensive review of all settings before the data generation starts. By giving users flexibility in both the input configurations and output formats, this tool ensures the generated dataset meets the specific needs of the project. The generated dataset can be downloaded or stored for future use, leveraging powerful models like GPT-4 to produce high-quality synthetic data.

This overview, combined with the final UI configuration in the images, helps to ensure that the synthetic data creation process is seamless and flexible.