How to Generate 1TB of Synthetic Data Faster Than a Coffee Break
And Cheaper Than Your Starbucks Coffee
Imagine creating a massive 1 Terabyte dataset of IoT data in less time than it takes to enjoy your coffee break. With synthetic data generation techniques and a bit more computing power, this becomes a reality. By leveraging an 4-core machine, we can process an astounding 1 million rows per second, with each row containing 1 KB of data. Let's break down what this means:
1 billion rows of 1 KB each equates to 1000 GB or 1 TB of data.
At a rate of 1 million rows per second, it takes approximately 1000 seconds (about 16.66 minutes) to generate 1 billion rows.
This means you can create a full terabyte of synthetic IoT data in under 10 minutes on 8 core machine; I have used 4 in my example. Such rapid data generation opens up exciting possibilities for developers, data scientists, and researchers working on big data projects, IoT applications, or machine learning models that require extensive datasets for training and testing.
Why create Synthetic datasets?
Privacy and Compliance: Synthetic data allows developers to work with realistic data without risking exposure of sensitive information, helping to meet data protection regulations.
Scalability and Control: You can generate virtually unlimited amounts of data with precise control over its characteristics, enabling thorough testing of systems at scale and creation of edge cases that might be rare or impossible to capture in real-world data.
Development Acceleration: By removing dependency on upstream teams for data, developers can build end-to-end pipelines, set up DevOps processes, and address architectural concerns before actual data becomes available, significantly speeding up the development process.
Cost-Effectiveness and Efficiency: Generating synthetic data is often faster and more economical than collecting and processing real-world data, especially for large-scale testing and development.
Hardware
We used a single machine with 4 cores and 32 GB of memory
Let’s get into the code
Install Databricks Data Generator
The dbldatagen is a Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos, and many other uses.
It operates by defining a data generation specification in code that controls how the synthetic data is generated. The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion. You can use it from Scala, R or other languages by defining a view over the generated data.
%pip install dbldatagen
Setup and Imports
import dbldatagen as dg
import uuid
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, IntegerType
from pyspark.sql.functions import expr
Parameters
# Parameterize partitions and rows per second
PARTITIONS = 4 # Match with number of cores on your cluster
ROWS_PER_SECOND = 1 * 1000 * 1000 # 1 Million rows per second
Schema Definition
iot_data_schema = StructType([
StructField("device_id", StringType(), False),
StructField("event_timestamp", TimestampType(), False),
StructField("temperature", DoubleType(), False),
StructField("humidity", DoubleType(), False),
StructField("pressure", DoubleType(), False),
StructField("battery_level", IntegerType(), False),
StructField("device_type", StringType(), False),
StructField("error_code", IntegerType(), True),
StructField("signal_strength", IntegerType(), False)
])
Here, we define the schema for our IoT data. Each StructField
represents a column in our dataset, specifying the name, data type, and whether it can contain null values. This schema mimics real-world IoT device data, including device identifiers, sensor readings, and status information.
Why use Databricks Data Generator- dbldatagen
?
Using dbldatagen
for synthetic data generation offers several significant benefits that align closely with the characteristics of your actual data. The ability to specify parameters like minValue
, maxValue
, random
, and percentNulls
allows you to create datasets that closely mimic real-world scenarios. This means you can generate realistic variations in your data, such as different temperature ranges or device IDs, while also controlling for missing values. By tailoring these specifications, you ensure that the synthetic data is not only large in volume but also rich in diversity, making it a valuable resource for testing and training machine learning models effectively.
dataspec = (
dg.DataGenerator(spark, name="iot_data", partitions=PARTITIONS)
.withSchema(iot_data_schema)
.withColumnSpec("device_id", percentNulls=0.1, minValue=1000, maxValue=9999, prefix="DEV_", random=True)
.withColumnSpec("event_timestamp", begin="2023-01-01 00:00:00", end="2023-12-31 23:59:59", random=True)
.withColumnSpec("temperature", minValue=-10.0, maxValue=40.0, random=True)
.withColumnSpec("humidity", minValue=0.0, maxValue=100.0, random=True)
.withColumnSpec("pressure", minValue=900.0, maxValue=1100.0, random=True)
.withColumnSpec("battery_level", minValue=0, maxValue=100, random=True)
.withColumnSpec("device_type", values=["Sensor", "Actuator", "Gateway", "Controller"], random=True)
.withColumnSpec("error_code", minValue=0, maxValue=999, random=True, percentNulls=0.2)
.withColumnSpec("signal_strength", minValue=-100, maxValue=0, random=True)
)
This section creates a data generator specification using dbldatagen
. For each column, we define the data generation rules, including value ranges, randomness, and special formatting (like the "DEV_" prefix for device IDs). This ensures our synthetic data closely resembles real IoT data patterns.
Streaming DataFrame Creation
streaming_df = (
dataspec.build(
withStreaming=True,
options={
'rowsPerSecond': ROWS_PER_SECOND,
}
)
.withColumn(
"firmware_version",
expr(
"concat('v', cast(floor(rand() * 10) as string), '.', "
"cast(floor(rand() * 10) as string), '.', "
"cast(floor(rand() * 10) as string))"
)
)
.withColumn(
"location",
expr(
"concat(cast(rand() * 180 - 90 as decimal(8,6)), ',', "
"cast(rand() * 360 - 180 as decimal(9,6)))"
)
)
.withColumn(
"data_payload",
expr("repeat(uuid(), 22)") # Add approx. 800 Bytes to construct 1 KB row
)
)
streaming_df = ( dataspec.build( withStreaming=True, options={ 'rowsPerSecond': ROWS_PER_SECOND, } ) .withColumn( "firmware_version", expr( "concat('v', cast(floor(rand() * 10) as string), '.', " "cast(floor(rand() * 10) as string), '.', " "cast(floor(rand() * 10) as string))" ) ) .withColumn( "location", expr( "concat(cast(rand() * 180 - 90 as decimal(8,6)), ',', " "cast(rand() * 360 - 180 as decimal(9,6)))" ) ) .withColumn( "data_payload", expr("repeat(uuid(), 22)") # Add approx. 800 Bytes to construct 1 KB row ) )
Here, we build the streaming DataFrame using our data specification. We enable streaming with `withStreaming=True`
and set the rows per second. We also add additional columns:
firmware_version
: A randomly generated version number.location
: Random latitude and longitude coordinates.data_payload
: A large string to reach our 1 KB per row target.
Data Writing
streaming_df.writeStream
.queryName("iot_data_stream")
.outputMode("append")
.option("checkpointLocation", f"/tmp/dbldatagen/streamingDemo/checkpoint-{uuid.uuid4()}")
.toTable("soni.default.iot_data_1kb_rows")
Finally, we initiate the streaming process. The data is written to a Delta table named "iot_data_1kb_rows" in append mode. A checkpoint location is specified to allow for fault-tolerant execution of the streaming query.
Is it really cheaper than Starbucks coffee?
On a 4-core setup, generating a billion rows of synthetic IoT data would take approximately 17 minutes. Adding an extra 5 minutes as a buffer for instance setup brings the total time to 22 minutes.
Cost Breakdown for Generating 1 Billion Rows of Synthetic IoT Data:
Total time: 17 minutes (data generation) + 5 minutes (instance setup) = 22 minutes
Time in hours: 22÷60=0.3667 hours
Cost EC2 + Databricks: $0.228
This means generating a terabyte of synthetic IoT data costs just $0.228—less than the price of all coffee options. Such efficiency showcases the cost-effectiveness of synthetic data generation, enabling developers and data scientists to create large-scale datasets for testing and development at a fraction of the cost of traditional methods.
Furthermore, as illustrated in the graph below, the CPU utilization consistently exceeds 80%, highlighting the system's optimized performance and contributing to the remarkably low cost.
Stay Connected & Keep Learning: Join Our Community
If you found this post helpful, please drop a like to keep me motivated! And feel free to leave a comment below if you have any questions or thoughts—I'd love to hear from you!
If this is your first time reading my content, Welcome! I write in-depth technical blogs on Spark, Databricks, and Spark Streaming. Beyond writing, I specialize in helping data professionals unlock their full potential and ace their next data interviews.
Here are a few ways you can continue to learn, connect, and grow:
Join Us on WhatsApp: Stay updated and engage with the community through our WhatsApp group. Join here.
Join Our Discord Community: Connect with past clients and other data enthusiasts on our Discord server. It’s a great place to network, pair up with peers, and accelerate each other’s journeys. Join here.
Visit My Website: My website is your go-to resource for content, including blogs, tutorials, and more. Check it out here.