Has anyone here used Dremio?

Our data science organization is a bit behind in tech, to say the least. We keep all our files in flat files on various servers and don’t have access to any relational databases. Our CEO has stated that he doesn’t believe in outsourcing our data storage to the cloud. Now, the tech team is excited to introduce the cutting-edge Dremio service. From some quick searches, it appears to be more geared towards cloud data lakes. However, if it allows me to query our non-cloud relational data with the low latency the sales pitch provides, I’ll be happy.

Currently, our options are to either read the entire dataset into memory and process it using Python or R, or set up an ETL process to move the data from a flat file into HDFS. However, the latency tends to mean that any individual query will take a minimum of 30 seconds. Our data is relatively large but not really “big data”. The main dataset we use is around 50GB.

Does anyone have any experience with their company adopting Dremio and whether it actually lives up to the sales pitch?

3 Likes

Dremio’s introduction might speed up data querying procedures and provide low-latency options for relational data stored outside of the cloud. Although its primary application is for cloud data lakes, its features might also be useful for your company’s on-premises configuration. To determine whether Dremio is successful and appropriate for your data science requirements, get advice from businesses that have personally implemented the software.

5 Likes

The majority of the challenges of using query engines over object storage are in the continuing maintenance of files (partitioning, compression, combining numerous small files into fewer larger files).
Cloud data lakes are extremely beneficial for exploratory and ad hoc data analysis. If you require sub-second queries on structured data, you should utilise a data warehouse.
I’m not entirely sure about your use case; you mentioned Python/R, but understanding your end-to-end use case(s) would allow us to offer more specific solutions.
Consider using a solution like Upsolver to handle data throughout the lake. Please keep in mind that I work there, thus my opinions may be biassed.

2 Likes

Dremio is a data lakehouse platform designed to help users analyze and process large volumes of data efficiently. It integrates with various data sources and provides features for data virtualization, acceleration, and security.

1 Like

I quickly reviewed Dremio, which is described as a “data lake engine that creates a semantic layer.” To simplify, the tool essentially creates a structured view of your source files in the data lake, making them more accessible in a tabular format. However, this explanation oversimplifies its capabilities.

Key concerns:

I’m skeptical about virtualization without persisting significant data. Since Dremio queries source files directly or intermediate HDFS/Parquet files it generates, the end-user experience could be sluggish due to the absence of robust ETL processes. I find this suitable for proof of concepts or temporary solutions, but not for production environments.

Dremio appears heavily focused on SQL and business intelligence (BI). Does this align with your goals? Your post mentions Python/R but lacks SQL discussion.

Creating what Dremio offers seems feasible in a cloud environment without extra costs. Azure Synapse Analytics, for example, supports accessing external files from data lakes and building external tables, allowing similar functionality. On-premises, Hive can achieve this with adequate hardware.

Consider the implications for your CEO’s perception. Introducing Dremio might influence future cloud adoption decisions negatively if it’s perceived as a risky investment. It’s crucial to strategize on convincing your CEO to invest in a proper cloud environment, as data science typically requires more than just flat files to thrive.

1 Like

I attempted to use Dremio, but I gave up because the setup required too much work. The product’s capacity to build a data lake in a heterogeneous environment is one of its selling points. At a new startup, I’d give it considerable thought, but adding to an already-established business has proven difficult—at least for me.

1 Like

You can move around on their back and regain abilities. If you’re using Kubernetes, you can deploy using their helm.

1 Like

I never had my Dremio deployment up and running at my previous place, so there is one apparent disadvantage.

1 Like

I tried using Dremio, but the setup was too challenging, so I gave up. I like the product because it can create a data lake in a heterogeneous environment. I’d seriously consider it for a new startup, but integrating it into a semi-mature company has been a difficult task for me.

1 Like

Most of the pains of using query engines over object storage are in the ongoing management of files (partitioning, compression, merging many small files into fewer larger files)
Cloud data lakes are tremendously valuable when it comes to exploratory and ad-hoc data analysis. If you really require sub-second queries on structured data, you’re better off with a data warehouse.
I’m not totally clear on your use case; you mentioned Python/R but understanding your end-to-end use case(s) would allow us to recommend more precise solutions.
You may consider a tool like Upsolver, which helps with the data management over the lake. Note: I am employed there so I am a tad biased :slight_smile:

1 Like

On cloud data lakes, Dremio is a data lakehouse platform that facilitates quick analytics and data processing. It’s made so that users can query and examine data in the data lake directly, without requiring ETL procedures.

Have you personally worked with Dremio? or alternatively, could you please share any feedback or insights from companies that have implemented Dremio, especially those with on-premises configurations?