Differences between revisions 2 and 3

Processing And Analyzing Extremely Large Amounts Of Data In Python

Abstract

Many scientific applications frequently need to save and read extremely large amounts of data (frequently, this data is derived from experimental devices). Analyzing the data requires re-reading it many times in order to select the most appropriate data that reflects the scenario under study. In general, it is not necessary to modify the gathered data (except perhaps to enlarge the dataset), but simply access it multiple times from multiple points of entry.

The goal of [http://pytables.sourceforge.net PyTables] is to address this requirements by enabling the end user to manipulate easily scientific data tables, numarray objects and Numerical Python objects in a persistent, hierarchical structure.

Capabilities

During my talk, I'll be describing the capabilities of the forthcoming PyTables 0.3 version, which include:

Appendable tables: It supports adding records to already created tables.
Unlimited data size: Allows working with tables with a large number of records, i.e. that don't fit in memory.
Support of Numeric and numarray Python arrays.
Hierarchical data model: Pytables builds up an object tree in memory that replicates the hierarchical structure existing on disk.
Data compression: It supports data compression (through the use of the zlib library) out of the box.
Support of files bigger than 2 GB.
Ability to read generic HDF5 files and work natively with them.
Architecture-independent: PyTables has been carefully coded (as HDF5 itself) with little-endian/big-endian byte orderings issues in mind.
Optimized I/O: PyTables has been designed from the ground with performance in mind. We will see some benchmarks comparing PyTables speed with other databases.

-  ⇤ ← Revision 2 as of 2003-02-26 20:15:55 → 
  Size: 3298
  Editor: 170
  Comment:
+   ← Revision 3 as of 2003-02-26 20:20:56 → ⇥
  Size: 1874
  Editor: 170
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Line 25:
+Line 24:
- This can be done without copying the dataset or redefining its structure,
 even between different Python sessions.
-Line 31:
+Line 28:
- * Support of Numeric and numarray Python arrays: Numeric arrays are a very
 useful complement of tables to keep homogeneous table slices (like
 selections of table columns). Also, you can define a column in a table to
 be a one-dimensional (n-dimensional generalization will come in the future)
 array.
+ * Support of Numeric and numarray Python arrays.
-Line 38:
+Line 31:
- replicates the hierarchical structure existing on disk. That way, the
 access to the objects on disk is made by walking throughout the PyTables
 object tree, and manipulating them. This approach is proven to be very
 effective when working with complex data trees.
+ replicates the hierarchical structure existing on disk.
-Line 44:
+Line 34:
- zlib library) out of the box. This become important when you have
 repetitive data patterns and don't have time for searching an optimized way
 to save them.
+ zlib library) out of the box.
-Line 48:
+Line 36:
- * Support of files bigger than 2 GB: This is because HDF5 already can do
 that (if your platform supports the C long long integer, or, on Windows,
 __int64).
+ * Support of files bigger than 2 GB.
-Line 52:
+Line 38:
- * Ability to read generic HDF5 files and work natively with them. So, you
 can create your HDF5 files in C or Fortran, and open them with PyTables.
 Then, you can do any kind of operation with these HDF5 objects that
 PyTables allows you.
+ * Ability to read generic HDF5 files and work natively with them.
-Line 58:
+Line 41:
- itself) with little-endian/big-endian byte orderings issues in mind. So, in
 principle, you can write a file in a big-endian machine (like a Sparc or
 MIPS) and read it in other little-endian (like Intel or Alpha) without
 problems.
+ itself) with little-endian/big-endian byte orderings issues in mind.
-Line 64:
+Line 44:
- performance in mind. In its newest encarnation, it can read and write
 tables and arrays from/to disk at an speed generaly only bounded by the
 disk I/O speed. This levels of performance can be achieved because a smart
 combination of buffered I/O, use of Pyrex extensions, HDF5 and
 numarray libraries, and last, but not least, Psyco.
+ performance in mind. We will see some benchmarks comparing PyTables speed
 with other databases.

Page

User

Processing And Analyzing Extremely Large Amounts Of Data In Python

Abstract

Capabilities