Dataframe DB와 kdb+

1.
요즘 Python으로 만들어진 프로그램을 C로 구현하는 일을 진행중입니다. 논리를 파악해서 C로 구현하는 일이야 C개발자에게는 어렵지 않습니다. 다만 Pandas Library를 사용하면서 어려움을 겪었습니다. Pandas Dataframe으로 데이타를 관리하였기때문에 C로 자료구조를 어떻게 할지 골치 아팠습니다. 솔직히 Dataframe이 익숙하지 않습니다. 공부겸 해서 dataframe과 관련한 자료를 찾았습니다.

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

A data frame is a table-like data structure available in languages like R and Python. Statisticians, scientists, and programmers use them in data analysis code.

What Is a Data Frame? (In Python, R, and SQL)에 올라온 논문을 보면 조금더 이론적으로 접근합니다.

Download (PDF, 592KB)

Dataframe을 화두로 올린 이유는 pandas dataframe을 지원하는 DB이면서 C API를 제공하는 TileDB때문입니다.

What is TileDB? TileDB is a powerful engine architected around multi-dimensional arrays that enables storing and accessing:
Dense arrays (e.g., satellite images)
Sparse arrays (e.g., LiDAR, genomics)
Dataframes (any data in tabular form)
Key-values (mappings between keys and values)

You can use TileDB to store data in a variety of applications, such as Genomics, Geospatial, Finance and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array (even a dataframe and a key-value store), which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite data science tool.

TileDB has the following features:
Cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage)
Tiling (i.e., chunking) for fast slicing
Multiple compression, encryption and checksum filters
Fully multi-threaded implementation
Parallel IO
Data versioning (rapid updates, time traveling)
Array metadata
Array groups
Embeddable C++ library
Numerous integrations (Spark, Dask, MariaDB, GDAL, etc.)

Github에서 소스를 확인하실 수 있고 문서 또한 충실합니다. C API Reference는 API Reference에서 확인하였습니다.

TileDB 2.0 and the Future of Data Science
Using the TileDB C Library

2.
Shakti라는 제품은 어떤 범주의 제품인지 규정하기 애매합니다. DB이지만 다양한 기능을 제공합니다.

Shakti merges database, language, connectivity and stream processing into one powerful platform

PLATFORM CAPABILITIES

Unified architecture

time-series database

streaming analytics

real-time database

historical data warehouse

relational data

Enhanced time-series support

Tens of millions of messages per second

Billions of records/terabytes of in-memory data

Trillions of records/petabytes of on-disk data

No dependencies

Structured, semi-structured, and unstructured data

Multiple Formats (CSV, JSON etc.)

NoSQL and NewSQL features

미국이나 유럽의 대형투자은행이 많이 도입한 Kdb와 유사한 특징을 가지고 있는 듯 합니다. 이런 해석을 하는 이유가 설립자입니다. Arthur Whitney는 kdb+를 만든 컴퓨터과학자이기 때문입니다.

Released last month, Whitney’s new creation is called Shakti (after the name for primordial cosmic energy in Hinduism). According to Whitney, Shakti boasts a level of sophistication that will make it, “a new standard for data storage and analysis within financial services and beyond.”

저도 궁금해서 다운을 받아서 설치하려고 했는데 실행이 되지 않네요. 2021년 9월 기준으로 업데이트가 없는 듯 합니다. 홈페이지도 엉망이고

이 글 공유하기:

Leave a Comment 응답 취소