1
6
submitted 1 week ago by [email protected] to c/[email protected]

7/3/2024

Steven Wang writes:

Many in the data space are now aware of Iceberg and its powerful features that bring database-like functionality to files stored in the likes of S3 or GCS. But Iceberg is just one piece of the puzzle when it comes to transforming files in a data lake into a Lakehouse capable of analytical and ML workloads. Along with Iceberg, which is primarily a table format, a query engine is also required to run queries against the tables and schemas managed by Iceberg. In this post we explore some of the query engines available to those looking to build a data stack around Iceberg: Snowflake, Spark, Trino, and DuckDB.

...

DuckDB + Iceberg Example

We will be loading 12 months of NYC yellow cab trip data (April 2023 - April 2024) into Iceberg tables and demonstrating how DuckDB can query these tables.

Read Comparing Iceberg Query Engines

2
6
submitted 1 week ago by [email protected] to c/[email protected]

Let me share my post with a detailed step by step guide how an exisiting Spark scala library may be adopted to work with recently introduced Spark Connect. As an example I have chosen a pupular open source data quality tool AWS Deequ. I made all the necessary protobuf messages and a Spark Connect Plugin. I tested it from PySpark Connect 3.5.1 and it works. Of course, all the code is public in git.

3
11
submitted 4 weeks ago by [email protected] to c/[email protected]

Time and again I see the same questions asked: "Why should I use dbt?" or "I don't understand what value dbt offers". So I thought I'd put together an article that touches on some of the benefits, as well as putting together a step through on setting up a new project (using DuckDB as the database), complete with associated GitHub repo for you to take a look at.

Having used dbt since early 2018, and with my partner being a dbt trainer, I hope that this article is useful for some of you. The link is paywall bypassed.

4
1
submitted 2 months ago by [email protected] to c/[email protected]
5
20
submitted 2 months ago by [email protected] to c/[email protected]

If you're a Data Engineer, before long you'll be asked to build a real-time pipeline.

In my latest article, I build a real-time pipeline using Kafka, Polars and Delta tables to demonstrate how these can work together. Everything is available to try yourself in the associated GitHub repo. So if you're curious, take a moment to check out this technical post.

6
19
Diagrams as Code (medium.com)
submitted 2 months ago by [email protected] to c/[email protected]

How often do you build and edit Entity Relationship Diagrams? If the answer is ‘more often than I’d like’, and you’re fed up with tweaking your diagrams, take <5 minutes to read my latest article on building your diagrams with code. Track their changes in GitHub, have them build as part of your CI/CD pipeline, and even drop them into your dbt docs if you like.

This is a ‘friends and family’ link, so it’ll bypass the usual Medium paywall.

I’m not affiliated to the tool I’ve chosen in any way. Just like how it works.

Let me know yours thoughts!

7
0
submitted 2 months ago by [email protected] to c/[email protected]
8
5
submitted 4 months ago by [email protected] to c/[email protected]
9
7
submitted 4 months ago by [email protected] to c/[email protected]

Mar 8, 2024 | Hakampreet Singh Pandher writes:

Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This is facilitated by Yelp’s underlying data pipeline infrastructure, which manages the real-time flow of millions of messages originating from a plethora of services. This blog post covers how we leverage Yelp’s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp’s Business Properties ecosystem (explained in the upcoming sections) as an example.

Read Building data abstractions with streaming at Yelp

10
7
submitted 4 months ago by [email protected] to c/[email protected]

I’ve written a series of Medium articles on creating a Data Pipeline from scratch, using Polars and DeltaTables. The first (linked) is an overview with link to the GitHub repository and each of the deeper dive articles. I then go into the next level of detail, walking through each component.

The articles are paywalled (it took time to build and document), but the link provided is the ‘family & friends’ link which bypasses the paywall for the Lemmy community.

I hope some of you may find this helpful.

11
17
submitted 5 months ago by [email protected] to c/[email protected]

Hello,

I am looking for some advice to help me out at my job. Apologies if this is the wrong place to ask.

So, basically my boss is a complete technophobe and all of our data is stored across multiple excel files in drop box and I'm looking for a way to change that into a centralized database. I know my way around a computer but writing code is not something I have ever been able to grasp well.

The main issue with our situation is that our workers are all completely remote, and no I don't mean work from home in the suburbs from a home office. They use little laptops with no data connection and go out gathering data every day from a variety of locations, sometimes not even cell coverage.

We need up to 20 people entering data all day long and then updating a centralized database at the end of the day when they get back home and have internet connection. It will generally all be new entries, no one will need to be updating old entries.

It would be nice to have some sort of data entry form in drop box and a centralized database on our local server at head office which pulls the data at the end of each day. Field workers would also need access to certain data such as addresses, contact info, maps, photos, historical data, etc. but not all of it. For example the worker in City A only needs access to the historical data from records in and around City A, and workers in City B only need access to records involving City B.

Is there any recommended options for software which can achieve this? It needs to be relatively user friendly and simple as our workers are typically biology oriented summer students, not programmers.

12
8
submitted 5 months ago by [email protected] to c/[email protected]

Apple donated to community their own implementation of native physical execution of Apache Spark plan with Data Fusion.

13
5
submitted 5 months ago by [email protected] to c/[email protected]

A few years ago, if you'd mentioned Infrastructure-as-Code (IaC) to me, I would've given you a puzzled look. However I'm now on the bandwagon. And to help others understand how it can benefit them, I've pulled together a simple GitHub repo that showcases how Terraform can be used with Snowflake to manage users, roles, warehouses and databases.

The readme hopefully gives anyone who wants to give it a go the ability to step through and see results. I'm sharing this in the hopes that it is useful to some of you.

14
6
submitted 5 months ago by [email protected] to c/[email protected]

December 28 2023 Pankaj Singh writes:

In big data processing and analytics, choosing the right tool is paramount for efficiently extracting meaningful insights from vast datasets. Two popular frameworks that have gained significant traction in the industry are Apache Spark and Presto. Both are designed to handle large-scale data processing efficiently, yet they have distinct features and use cases. As organizations grapple with the complexities of handling massive volumes of data, a comprehensive understanding of Spark and Presto’s nuances and distinctive features becomes essential. In this article, we will compare Spark vs Presto, exploring their performance and scalability, data processing capabilities, ecosystem, integration, and use cases and applications.

Read Spark vs Presto: A Comprehensive Comparison

15
14
submitted 5 months ago by [email protected] to c/[email protected]

Hi all,

For those wanting a quick repo to use as a basis to get started, I’ve created jen-ai.

There are full instructions in the readme. Once running you can talk to it, and it will respond.

It’s basic, but a place to start.

16
14
submitted 5 months ago by [email protected] to c/[email protected]
17
10
submitted 5 months ago* (last edited 5 months ago) by [email protected] to c/[email protected]

Since there's only one mod here and they are also an Admin. I'd like to volunteer to moderate this community.

18
5
submitted 6 months ago by [email protected] to c/[email protected]

Karl W. Broman & Kara H. Woo write:

Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.

Read Data Organization in Spreadsheets

This article is weird in that it appears to be written for an audience that would find its contents irrelevant, but it has great information for people that are trying to reduce or eliminate their use of spreadsheets.

19
4
submitted 6 months ago by [email protected] to c/[email protected]

cross-posted from: https://programming.dev/post/8246313

Data science managers and leaders should make sure that cooperative work on models is facilitated and streamlined. In this post, our very own Shachaf Poran, PhD suggests one method of doing so.

20
8
Database Fundamentals (tontinton.com)
submitted 7 months ago by [email protected] to c/[email protected]

Posted on 2023-12-15 by Tony Solomonik

About a year ago, I tried thinking which database I should choose for my next project, and came to the realization that I don't really know the differences of databases enough. I went to different database websites and saw mostly marketing and words I don't understand.

This is when I decided to read the excellent books Database Internals by Alex Petrov and Designing Data-Intensive Applications by Martin Kleppmann.

The books piqued my curiosity enough to write my own little database I called dbeel.

This post is basically a short summary of these books, with a focus on the fundamental problems a database engineer thinks about in the shower.

Read Database Fundamentals

21
8
submitted 7 months ago by [email protected] to c/[email protected]

Analytics Data Storage

Over the past 40 years, businesses have had a common problem- the data trying to create analytics information off raw application data doesn't work well. The format isn't ideal for analytics tools, and analytics workloads can cause spikes in performance for critical applications, and the data for a single report could come from many different sources. AI has worsened the issue since it needs another type of data formatting and access. The primary solution has been to copy the data into a separate storage solution better suited to analytics and AI needs.

Data Warehouses

Data warehouses are large, centralized repositories for storing, managing, and analyzing vast amounts of structured data from various sources. They are designed to support efficient querying and reporting, providing businesses with crucial insights for informed decision-making. Data warehouses utilize a schema-based design, often employing star or snowflake schemas, which organize data in tables related by keys. They are usually built using SQL engines. Small data warehouses can easily be created on the same software as applications, but specialized SQL engines like Amazon Redshift, Google BigQuery, and Snowflake are designed to handle analytics data on a much larger scale.

The main goal of a data warehouse is to support Online Analytical Processing (OLAP), allowing users to perform multidimensional data analysis, exploring it from different perspectives and dimensions. The primary issue with this format is that they aren't well suited for machine learning applications because the data access options don't work with machine learning tools at scale.

Star Schemas

Star and snowflake schemas are similar- a snowflake schema is essentially a star schema that allows for additional complexity. The name comes from the fact that diagrams of the various tables and their connections look like stars and snowflakes.

Data is divided into two types of tables: fact tables and dimensions. Dimension tables contain all the data common across events. For a specific sale at a retailer, there might be tables for customer information, the sale date, the order status, and the products sold. Those would be linked to a single fact table containing data specific to that sale, like the purchase and retail prices.

Data Lakes

Data warehouses had three main failings that required the creation of data lakes.

  1. The data isn't easily accessible for ML tools.
  2. Data Warehouses don't handle unstructured data like images and audio very well.
  3. Data storage costs are much higher than what would become data lakes.

Data lakes are usually built on top of an implementation of HDFS or a Hadoop Distributed File System. Previous file storage options had limits on how much data they could store, and large companies were starting to exceed those limits. HDFS effectively removes those limits since existing technology can handle data storage at a significantly larger scale than any current data lake requirements, and it is possible to expand this in the future.

Data warehouses are considered "schema on write," where the data is organized into tables before it is written to storage. Data lakes are stored in files that are primarily "schema on read" where the data may only be partially structured in the loading process and transformations are applied when a system goes to read the data.

Data lakes are often structured in layers. The first layer is the data in its original, raw format. The following layers will have increasing levels of structure. For example, an image might be in its raw form in the first layer, but later layers may have information about the image instead. The new file might have the image's metadata, a link to the original image, and an AI-generated description of what it contains.

Data lakes solved the scale problem and are more useful for machine learning, but they didn't have the same quality controls over that data that warehouses do, and "schema on read" creates significant performance problems. For the past decade, companies have maintained a complex ecosystem of data lakes, warehouses, and real-time analytics engines like Kafka. This supports analytics and ML, but creating, maintaining, and processing data is exceptionally time-consuming with many systems involved.

Data Lakehouse

Data Lakehouse is an emerging style of data storage that attempts to combine the benefits of both data lakes and data warehouses. The foundation of a lakehouse is a data lake to support the exabytes of data that data lakes can currently contain.

The first innovation in this direction is technologies like delta lake, which combines Apache Spark for data processing with parquet file formats to create data lake layers that support transactions and data quality controls while maintaining a compact data format. This tool is ideal for ML uses since it solves the data quality problems of data lakes and the data access problems from data warehouses. Current work focuses on allowing a lakehouse to provide more effective analytics data with tools like caching and indexes.

22
9
submitted 7 months ago by [email protected] to c/[email protected]
23
10
submitted 8 months ago by [email protected] to c/[email protected]
24
-4
submitted 9 months ago by [email protected] to c/[email protected]
25
-4
submitted 9 months ago by [email protected] to c/[email protected]
view more: next ›

Data Engineering

355 readers
1 users here now

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 1 year ago
MODERATORS