We live in a world of data: There’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with Data is your window into the ways data teams are tackling the challenges of this new world to help their companies and their customers thrive.
Streaming data analytics is expected to grow into a $38.6 billion market by 2025. The world is moving faster than ever, and companies processing large amounts of rapidly changing or growing data need to evolve to keep up — especially with the growth of Internet of Things (IoT) devices all around us. Real-time insights provide businesses with unique advantages that are rapidly becoming must-haves for staying competitive in a variety of markets. Let’s look at a few ways that different industries take advantage of streaming data.
How industries can benefit from streaming data
- Sales, marketing, ad tech: Making faster marketing decisions, optimizing ad spend
- Security, fraud detection: Reducing the time needed to detect and respond to malicious threats
- Manufacturing, supply chain management: Increasing inventory accuracy, avoiding production line downtime, monitoring current IoT and machine-to-machine data
- Energy, utilities: Analyzing IoT data to alert and address equipment issues, get ahead of potential issues, help reduce maintenance costs
- Finance, fintech: Tracking customer behavior, analyzing account activities, responding to fraud and customer needs immediately and proactively while they’re engaged.
- Automotive: Monitoring connected, autonomous cars in real time to optimize routes to avoid traffic and for diagnosis of mechanical issues
As real-time analytics and machine learning stream processing are growing rapidly, they introduce a new set of technological and conceptual challenges. In this piece, we’ll dig into those challenges and how Upsolver and Sisense are helping tackle them.
Technological challenges to handling streaming data
One of the main challenges when dealing with streaming data comes from performing stateful transformations for individual events. Unlike a batch processing job that runs within an isolated batch with clear start and end times, a stream processing job runs continuously on each event separately. Operations like API calls, joins, and aggregations that used to run every few minutes/hours now need to run many times per second. Dealing with this challenge requires caching the relevant context on the processing instances (state management) using techniques like sliding time windows.
Optimizing object storage
Another goal that teams dealing with streaming data may have is managing and optimizing a file system on object storage. Streaming data tends to be very high-volume and therefore expensive to store. However, cloud object storage, like Amazon S3, is a very cost-effective solution (starting at $23 per month per terabyte for hot storage at time of writing) compared to traditional databases and Kafka (which creates three replicas by default on local storage — that’s a lot of data being stored!).
The challenge with object storage is the complexity of optimizing its file system by combining the right file format, compression, and size. For example, small files are a performance anti-pattern for object storage (50X impact), but using streaming data forces us to create such files.
Cleaning up dirty data
Every data professional knows that ensuring data quality is vital to producing usable query results. Streaming data can be extra challenging in this regard, as it tends to be “dirty,” with new fields that are added without warning and frequent mistakes in the data collection process. In order to bridge the gap to analytics-ready data, developers have to be able to address data quality issues quickly.
The best architecture for that is called “event sourcing.” Implementing this requires a repository of all raw events, a schema on read, and an execution engine that transforms raw events into tables. Every time analytics data needs to be adjusted, the developer will run a processing job from the raw data repository (time travel/replay/reprocessing).
These are just a few of the technical considerations that teams need to grapple with when attempting to use real-time data. They also have to orchestrate huge volumes of jobs to handle rapidly changing data, work to ensure data consistency with exactly-once processing,
and deal with concurrent requests from a variety of users all trying to get insights from the same data at the same time.
How Upsolver solves stream processing
Sisense has partnered with Upsolver to help solve these technical challenges. Upsolver has spent countless hours to eliminate the engineering complexity so that companies can solve these types of data challenges in real time.
Challenges like job orchestration, exactly-once data consistency, and file system management are heavy engineering challenges. Solving them is a complex challenge, resulting in a data engineering bottleneck that slows the analytics process.
Upsolver encapsulates the streaming engineering complexity by empowering every technical user (data engineers, DBAs, analysts, scientists, developers) to ingest, discover, and prepare streaming data for analytics. These experts can define transformations from streams to tables and govern the processing progress using a visual, SQL-based interface. Engineering complexity is abstracted from the user via an execution engine that turns multiple data sources (stream and batch) into tables in various databases.
Let’s dig into how it works!
Step 1: Connect to data sources — cloud storage, data streams, databases
Once inside your Upsolver user interface (UI), users simply click on “Add a new data source” and choose one of the built-in connectors for cloud storage, databases, or streams. The UI allows users to parse their source data in formats including JSON, CSV, Avro, Parquet and Protobuf.
A sample from the parsed data is displayed before ingestion starts:
Schema on read and statistics per field are automatically detected and presented to the user:
Step 2: Define stateful transformations
Now that each data source has been set up, it’s time to define an output to be Upsolver’s entity for processing jobs. Each output creates one table in a target sync and populates it continuously with data based on the transformations the user defined.
Transformations can be defined via the UI, SQL, or both (bidirectional sync between UI and SQL).
(Note: The SQL statement isn’t for querying data like in databases. In Upsolver, SQL is used to define continuous processing jobs so the SQL statement is executed once for every source event.)
Upsolver provides over 200 built-in functions and ANSI SQL support out of the box to simplify stream processing. These native features hide implementation complexities from users so their time isn’t wasted on customized coding.
Upsolver also provides stateful transformations that allow the user to join the currently processed event with other data sources, run aggregation (with and without time windows), and deduplicate similar events.
Adding transformations using the UI:
Editing transformations with SQL:
The output below calculates the order total by aggregating net total and sales tax.
IF_DUPLICATE function for filtering similar events:
Defining keys for tables in Amazon Redshift:
This feature is necessary to support use cases like change data capture log streaming and enforcing GDPR/CCPA compliance in Amazon Redshift.
Step 3: Execute jobs query optimization
Now that your output is ready, click on RUN to execute your job.
First, fill in the login information for Redshift:
Second, choose the Upsolver cluster for running the output and the time frame to run at.
Upsolver clusters run on Amazon EC2 spot instances and scale out automatically based on compute utilization.
Note that during this entire process, the user didn’t need to define anything except data transformations: The processing job is automatically orchestrated, and exactly-once data consistency is guaranteed by the engine. The impact of implementing these best practices is faster queries that will power Redshift and dashboards in Sisense.
Step 4: Query
Log in to your Sisense environment with at least data designer privileges. We’ll be creating a live connection to the Redshift cluster that was set up in Step 3 and a simple dashboard. The first step is to navigate to the “Data” tab and create a new live model. You can give it whatever name you like.
Now, let’s connect to Redshift. Click the “+ Data” button in the upper right hand corner of the screen. Select Redshift from the list of available connections. Next, you’ll need to enter your Redshift location and credentials. The location is the URL/endpoint for the cluster. The credentials used should have read access to the data you are using. Make sure to check “Use SSL” and “Trust Server Certificate” (most clusters require these options by default).
When ready, click “Next” and select the database you’ll be connecting to. Then click “Next” again.
Here you can see all of the tables you have access to, separated by schema. Navigate to the schema/tables you want to import, and select a few to start with. You can always add more later. When ready, click “Done” to add these tables to your model.
Once you have tables to work with, it’s time to connect them. In this case, the tables link on “orderId.” Simply drag one table onto the other and select the linking field (the key) on both sides. When you see the line appear between the tables, you can click “Done.” Finally, click “Publish” in the upper right hand corner, and you’re ready to create a dashboard!
Now you need to navigate over to the “Analytics” tab. This is where we’ll create a new dashboard to view and explore the data. In there, you’ll see a button to create a new dashboard. Click it and select the live model you just published. You can give another name to the dashboard or it will default to the same name as the model.
Now, it’s time to build the dashboard and explore your data. In the following animation, we create a few different visualizations based on fields from both tables. Notice that we are selecting fields to use as both dimensions and measures from both tables in the model. Sisense is automatically joining the tables for us, and Upsolver keeps the data in both tables synced with the stream.
Feel free to explore your data now. You can left-click on elements of the dashboard to place a filter and right-click to drill down. You can also interact with filters on the right hand filter panel. Again, you can filter on fields in both tables, and Sisense will determine the right queries for you.
Getting your streaming data to work for you
Streaming data analytics is important for businesses to make critical decisions in real time. To get there, it’s necessary to solve several engineering complexities that are introduced with streaming and aren’t addressed with the batch technology stack. In this article, we showed how Upsolver, AWS, and Sisense can be used together to deliver an end-to-end solution for streaming data analytics that is quick to set up, easy to operate without coding, and scales elastically using cloud computing/storage.
Ori Rafael has a passion for taking technology and making it useful for people and organizations. Before founding Upsolver, Ori performed a variety of technology management roles at IDF’s elite technology intelligence unit, followed by corporate roles. Ori has a B.A. in Computer Science and an MBA.
Mei Long, PM of Upsolver has held senior positions in many high-profile technology startups. Before Upsolver, she played an instrumental role on the teams that contributed to the Apache Hadoop, Spark, Zeppelin, Kafka, and Kubernetes project. Mei has a B.S. in Computer Engineering.