Dataphor SQL RAC (Relational Application Companion)


A site of hope for those looking for a true relational database system

Sunday, September 10, 2006

Dataphor - Are You Constrained?

No where can the essence of a business problem or model be better
shown than thru the means of constraints. Constraints encapsulate
the logic of a model and govern the ebb and flow of an application.
Constraints in Dataphor, based on the relational model, offer these
benefits:

    * Emphasize data independence
    * The ability to reference global objects
    * The use of the declarative method in application development

The relational model allows and fosters the clear separation of
how a problem is modeled versus physical implementation details.
This is the relational concept of data independence. The solution
of a business problem should not be encumbered by anything that
detracts from the ability to clearly and precisely model it. This
is the essence of the concept of the logical model in relational
theory. Yet data independence is sorely lacking in sql where the
line of distinction between the logical model and performance
considerations are too often blurred. The most blatant example
is the conceptual entanglement of keys and indexes. A key, which
is a constraint, is a logical consideration. An index is a physical
consideration. By allowing indexes based on performance considerations
to directly enter the logic of the model the clarity of expression
that is needed becomes undermined or even lost. The price for a gain
in performance is paid for by the ambiguity introduced into the
model. Constraints in dataphor can be clearly expressed independent
of performance based concepts, ie. indexes.

In order to define constraints that need to reference global objects
like tables, sql needs to rely on mechanisms like triggers. All
to often this leads to problems with declared referential integrity
constraints and adds an additional layer of complexity to the
problem. Unlike simple column constraints or indexes, sql does
not allow a table constraint to be simply and clearly stated in a
'create table' or 'alter table' statement. Dataphor allows this
simple but powerfully expressive capability.

In sql the idea of metadata is independent of the logical constructs
used to derive data. In other words, in sql you can declare an
index but you can't declare any additional information about the
index. Dataphor allows metadata, information about the object, to
be declared along with the object. This additional information is
used in a declarative way to make application development easier.
Dataphors derivation engine thus absolves the developer of an
entire level of work. For example, by declaring metadata about
a constaint the meaning of the constraint can be clarified
while at the same time saving the developer the work to do this.

An interesting example of the confluence of constraints
(the logical model) and performance can be found in the article

'Indexes on Computed Columns: Speed Up Queries, Add Business Rules'

which can be viewed here. Very cleaver solutions to constraints
are proposed in the form of indexes on computed columns. Compare
this approach to simply, clearly and explicitly defining a
constraint with metadata in dataphor leaving performance optimization
as the last step of the application.

For example, one problem the article discusses is insuring that VPN
tokens are unique in the table while allowing any number of null VPN
token numbers.

I'll be using dataphor with MS sql server 2005 sp1 as the data
repository (backend).

Here is the table definition defined in dataphor:
(All dataphor keywords are in bold)

create table TESTC
 {
 EMPLOYEEID : Integer,
 LOCATIONID : String,
 DEPTID : String { default 'ONE' },
 VPNTOKEN : Integer { nil } ,
 key {EMPLOYEEID}
  }

The primary key is EMPLOYEEID. The 'Integer { nil }' is the dataphor
way of saying that VPNTOKEN can take a null (no value) as the data
is stored in sql server.

A table constrain can now be declared that captures the logic of
uniqueness for non null VPNTOKENs and allows unlimited nulls for it.
The constraint can be declared using the dataphor alter table.

alter table TESTC
 {
  create constraint DISTINCTVPNWITHNULLS
   exists(
          TESTC
           group add {Count(VPNTOKEN) CNT,Count(distinct VPNTOKEN) DISTVPNCNT}
             where CNT=DISTVPNCNT
          )
     tags
 {
     DAE.Message =
      "'VPNTOKEN ' + ToString(VPNTOKEN) + ' is not unique'"
 }   
  }      

The 'Count(VPNTOKEN)' and 'Count(distinct VPNTOKEN)' work the same as
they do in sql. Both count constructs eliminate null values. So the
where predicate is comparing the count of non nulls to the distinct
count of non nulls. The group statement has no by statement so
the comparison is over the whole table. The logic of the constraint
says if where is true then all the non nulls must be unique. If where
is false an attempt was made to enter an existing non null VPNTOKEN
and thus violiates the uniqueness of VPTOKEN in the table and the
constraint. The attempt to enter such a VPNTOKEN will be rejected.
Since where is independent of nulls, all null values for VPNTOKEN
will satsify the constraint (it evaluates to true) and will be
accepted. The only way a null VPNTOKEN will be rejected is if the
row being inserted violates the uniqueness of the key of the table
which is EMPLOYEEDID. This is the meaning of specifying a key. Only
unique values of EMPLOYEEID are acceptable.

The tag keyword indicates metadata is supplied with the constraint
definition. Dataphor will use the message when the constraint
is violated as opposed to a cryptic message returned by sql server
when a key/index is violated.

Lets insert some data with unique VPNTOKENs

insert
     table
      {
        row { 1 EMPLOYEEID,'A' LOCATIONID,'ONE' DEPTID, 1 VPNTOKEN },
        row { 2,'A','ONE',2},
        row { 3,'B','TWO',3}
      }
        into TESTC;
      
Insert some rows with null VPNTOKEN.

insert
     table
      {
        row { 4 EMPLOYEEID,'C' LOCATIONID,'ONE' DEPTID},
        row { 5 EMPLOYEEID,'C' LOCATIONID,'THREE' DEPTID}
      }
       into TESTC          
     
This is the data in the table.

select TESTC

EMPLOYEEID LOCATIONID DEPTID VPNTOKEN 
---------- ---------- ------ ----------
1          A          ONE    1        
2          A          ONE    2        
3          B          TWO    3        
4          C          ONE    <No Value>
5          C          THREE  <No Value>

Now in the derived form that dataphor has available for every table
and view we can try to enter a row such as:

 row { 7 EMPLOYEEID,'C' LOCATIONID,'THREE' DEPTID,2 VPNTOKEN}

Dataphor uses the tag metadata expression to clearly show the user
the problem with the specific data that violated the constraint:



I submit this is a significant improvement over the type of message
returned by sql server which tells you what object (ie. key)
caused the violation but gives no information about the 'data'
which caused it. The developer is left with this task on the frontend.

Any table constraint should be a straightforward short process to
represent. As another example, this constraint allows unique VPNTOKENs
but at most 1 null value.

alter table TESTC
 {
  create constraint DISTINCTVPNWITHONENULL
   exists(
           TESTC
            group add {
                       Count() TOTALCNT,
                       Count(VPNTOKEN) CNT,
                       Count(distinct VPNTOKEN) DISTVPNCNT
                      }
              where 
                (TOTALCNT-CNT < 2) and (CNT=DISTVPNCNT)
          )
     tags
 {
             DAE.Message =
              "'VPNTOKEN ' + IfNil(ToString(VPNTOKEN), 'NULL') + ' is not unique'"
 }   
  }   

The where statement uses '(TOTALCNT-CNT < 2)' to check that there is
at most 1 null value (there can be 0 or 1) and uses '(CNT=DISTVPNCNT)'
to insure that the non null values are unique. With a violation, the
message will display a VPNTOKEN value that already exists in the table
including 'NULL'.

Dataphor by default will display an informative message for any key
that is violated. It is not necessary to explicitly state any
message. On sql server a primary key violation displays only the
object of the violation. For example:

'Cannot insert duplicate key row in object 'dbo.testc' with unique
index 'UIDX_TESTC_EMPLOYEEID'.
The statement has been terminated.'

While dataphor will display the specific data that caused the
violation:



In the future I'll talk about other types of constraints in
dataphor such as column and transition row constraints.

Bye for now.

Tuesday, September 05, 2006

RAC - Rank This!

Sql Server 2005 Ranking functions vs. Rac Ranking

Principal players:
Rac v2.33 running on  (more info on Rac here)
MS Sql Server 2005 sp1

In general any abstraction for t-sql is good abstraction. And
the Sql-99 analytic ranking functions are indeed a welcome
addition to Sql Server 2005. There has been universal approval
and all kinds of favor has be accorded them from the Sql Server
community. But the history of sql is the history of the gang
that can't shoot straight. And the gang tends to always be a
day late and a dollar short. So did sql and MS show itself to
be all it can be with the new ranking functionality? Is the
Sql Server model superior or inferior to well..the 'RAC' model
of ranking. Two logically different approaches. Think of this
as Fox Sql, fair and balanced..you decide :)

Here is the general specification of a rank construct from Bol:

RANK_TYPE ( ) OVER ( [ < partition_by_clause > ] < order_by_clause > )

Excuse me if I ask just who thinks this stuff up? Perhaps they
should get off what their on or get back on what their off. It's
painful for those not steeped in sql speak and clearly doesn't 
capture the logical/conceptual meaning of the construct. How many
of you first looked at this spec and wondered just what is the basis
for the rank in the first place? So allow me to express it based on
common sense:

RANK_TYPE (For Column(s)/[ASC | DESC]) OVER TABLE <Optional:[PARTITION BY Column(s)]>

Res ipse loquitor - Now the thing speaks for itself. 

The rank is based on a column. To be fair this is what the sql
standard meant when they expressed the idea of the 'utility'
of a construct to express the ageless sql of a rank in a
subquery:

(select count(*) 
       from t1
         where t1.column<=t2.column) as Rank
         
The meaning of the ordering of the column and whether or not the
rank is obtained within groups/partitions or over the entire table
is clear.

Now here is the concept of a rank in Rac expressed in a similar way:

RANK_TYPE (For PARTITION Columns) OVER TABLE Optional ORDER BY Columns

Res ipse loquitor - Now the thing speaks for itself. 

Here the object of a rank is a partition! The sort order of the partition
are the column(s) of the partition but an optional sort order can be
defined with columns independent of the partition column(s). This is
a logically different concept than the sql rank. That an sql rank thru
a column(s) cannot encapsulate a rank predicated on partition(s) or
a structure if you will is because there is no guarantee that a column(s)
exists to capture or enumerate such a rank. In other words a single
subquery, the sql model, may not be adaquate. The underlying sql code
necessary to capture the rank may take the form of nested subqueries,
certainly more complicated than the sql rank single subquery.

The underlying Rac model for capturing ranks over partitions can be
seem here where I first proposed creating 'virtual groups':
    microsoft.public.sqlserver.programming 
    Mon, Jul 5 1999 
    'Difficult SQL Question'
    Name: 'Trysql' (but signed by yours truely -:)  
    The thread can be seen here. 
    
Of course a partition can be conceptually decomposed to simple column(s)
so the sql rank is encapsulated within the logical model of the Rac rank.

Here is an example that elucidates the two different ranking models.
(The data is extrapolated from an example of a Rac solution which can be seen here).

create table ranktest (ID tinyint primary key,ROWID tinyint,TYPE tinyint,POS tinyint,STRING char(1))
insert ranktest values(1,1,1,1,'1')
insert ranktest values(2,1,1,2,'2')
insert ranktest values(3,1,2,3,'-')
insert ranktest values(4,1,2,4,'%')
insert ranktest values(5,1,2,5,'=')
insert ranktest values(6,1,2,6,'B ')
insert ranktest values(7,1,1,7,'3')
insert ranktest values(8,1,1,8,'1')
insert ranktest values(9,1,1,9,'2')
insert ranktest values(10,1,2,10,'A')
insert ranktest values(11,2,2,1,'D')
insert ranktest values(12,2,2,2,'C')
insert ranktest values(13,2,2,3,'M')
insert ranktest values(14,2,1,4,'6')
insert ranktest values(15,2,1,5,'5')
insert ranktest values(16,2,2,6,'T')
insert ranktest values(17,2,2,7,'Y')
insert ranktest values(18,2,2,8,'W')
insert ranktest values(19,2,2,9,'R')
insert ranktest values(20,2,1,10,'6')
insert ranktest values(21,2,1,11,'7')

select * from ranktest

ID   ROWID TYPE POS  STRING 
---- ----- ---- ---- ------ 
1    1     1    1    1
2    1     1    2    2
3    1     2    3    -
4    1     2    4    %
5    1     2    5    =
6    1     2    6    B
7    1     1    7    3
8    1     1    8    1
9    1     1    9    2
10   1     2    10   A
11   2     2    1    D
12   2     2    2    C
13   2     2    3    M
14   2     1    4    6
15   2     1    5    5
16   2     2    6    T
17   2     2    7    Y
18   2     2    8    W
19   2     2    9    R
20   2     1    10   6
21   2     1    11   7

We want a dense rank for the partition ROWID,TYPE. Although ID 
and POS are redundant they imply a meaningfulness to the order
of rows both over the table and of TYPE within ROWID.

The Rac model can be represented conceptually as:

DENSE_RANK (For PARTITION {ROWID,TYPE}) OVER TABLE (ORDER BY ID)

or

DENSE_RANK (For PARTITION {ROWID,TYPE}) OVER TABLE (ORDER BY ROWID,POS)

Rac does not offer the autonomy of ranking specifications as in
sql. Partitions are defined by the ordinal position of columns
in @rows. In the absence of optional sorting of the rows using
@rowsort, the partition is ordered by the @rows specification
from left to right. With @rowsort the partition is orderd by the
@rowsort column(s). The Rac ranking parameter @rowindicators returns
a dense rank. 

The key parameters to obtain the dense rank are:

@rows='ROWID & TYPE & POS & STRING',
@rowsort='ID' or @rowsort='ROWID & POS'
@rowindicators='TYPE{DENSE_RANK}'

The @rows specification implies the needed partition. 
The 'TYPE' in the @rowindicators argument 'TYPE{DENSE_RANK}'
is the end point of the partition to obtain the dense rank for.
In other words, all partitions as defined in @rows will be included
for this dense rank up to and including TYPE. Therefore 'TYPE'
implies the inclusion 'ROWID' and the partition of 'ROWID,TYPE'
will be used for the rank. Any number of ranks can be defined
in a Rac execute but they will all be based on the implied partition
in @rows from left to right. A rank based on row_number() is
included via @rowcounters and uses the partition 'ROWID,TYPE'.
The name of the dense rank is within the tides, ie. '{DENSE_RANK}'.
The name of the row_number rank is 'ROW NUMBER'.

Rac executes:

Exec Rac
@transform='_dummy_',
@rows='ROWID & TYPE & POS & STRING',
-- Order the partitions (@rows).
@rowsort='ID',
@pvtcol='Report Mode',@defaultexceptions='dumy',
@from='ranktest',@defaults1='y',@rowbreak='n',@racheck='y',
@counterdatatype='tinyint',
-- DENSE_RANK() using PARTITION BY ROWID,TYPE ORDERED BY ID
@rowindicators='TYPE{DENSE_RANK}',
-- ROW_NUMBER() using PARTITION BY ROWID,TYPE ORDERED BY ID
@rowcounters='TYPE{ROW_NUMBER}'

OR

Exec Rac
@transform='_dummy_',
@rows='ROWID & TYPE & POS & STRING',
-- Order the partitions (@rows).
@rowsort='ROWID & POS'
@pvtcol='Report Mode',@defaultexceptions='dumy',
@from='ranktest',@defaults1='y',@rowbreak='n',@racheck='y',
@counterdatatype='tinyint',
@rowindicators='TYPE{DENSE_RANK}',
@rowcounters='TYPE{ROW_NUMBER}'

Result

ROWID  TYPE POS  STRING DENSE_RANK ROW_NUMBER 
------ ---- ---- ------ ---------- ---------- 
1      1    1    1      1          1
1      1    2    2      1          2
1      2    3    -      2          1
1      2    4    %      2          2
1      2    5    =      2          3
1      2    6    B      2          4
1      1    7    3      3          1
1      1    8    1      3          2
1      1    9    2      3          3
1      2    10   A      4          1
2      2    1    D      1          1
2      2    2    C      1          2
2      2    3    M      1          3
2      1    4    6      2          1
2      1    5    5      2          2
2      2    6    T      3          1
2      2    7    Y      3          2
2      2    8    W      3          3
2      2    9    R      3          4
2      1    10   6      4          1
2      1    11   7      4          2

With the sql rank the question becomes is there a column(s)
that encapsulates the bi-directional nature of TYPE within ROWID.
Sadly there is not. They only column appropriate for the sql rank
is the DENSE_RANK itself! In other words, the sql rank cannot
be stated:

DENSE_RANK() OVER (PARTITION BY ROWID,TYPE ORDER BY ?) as DENSE_RANK 

or is redundant and a non-solution:

DENSE_RANK() OVER (PARTITION BY ROWID ORDER BY TYPE) as DENSE_RANK 

Any of the sql ranking functions, not matter how they are forumulated,
will not reflect the meaning of the data. This is especially seen
by comparing the sql ROW_NUMBER() with the one from Rac :

select ID,ROWID,TYPE,POS,STRING,
ROW_NUMBER() OVER (PARTITION BY ROWID order by TYPE) as [TYPE ROW_NUMBER],
ROW_NUMBER() OVER (PARTITION BY ROWID,TYPE order by POS) as [POS ROW_NUMBER] 
from ranktest
order by ID

ID   ROWID TYPE POS  STRING TYPE ROW_NUMBER POS ROW_NUMBER RAC ROW_NUMBER 
---- ----- ---- ---- ------ --------------- -------------- -------------- 
1    1     1    1    1      1               1              1
2    1     1    2    2      2               2              2
3    1     2    3    -      7               1              1
4    1     2    4    %      8               2              2
5    1     2    5    =      9               3              3
6    1     2    6    B      10              4              4
7    1     1    7    3      3               3              1
8    1     1    8    1      4               4              2
9    1     1    9    2      5               5              3
10   1     2    10   A      6               5              1
11   2     2    1    D      9               1              1
12   2     2    2    C      10              2              2
13   2     2    3    M      11              3              3
14   2     1    4    6      1               1              1
15   2     1    5    5      2               2              2
16   2     2    6    T      5               4              1
17   2     2    7    Y      6               5              2
18   2     2    8    W      7               6              3
19   2     2    9    R      8               7              4
20   2     1    10   6      3               3              1
21   2     1    11   7      4               4              2

Again the ranks do not reflect the meaning of the data when the
rank is dependent on a column instead of a logical partition.

While the sql rank is indeed a welcome abstraction they could
have done a much better job. By using such a low level of
abstraction many clear and easy solutions to non-trivial problems
remain lost to most of the user community. This is unfortunate. 
While there is a certain quality in quantity sql never seems to
truely show that the sum is greater than the parts. And for those
who build database systems as well as those that use them and
who frown on abstraction, the burden is on them to tell the rest
of us why. Less can be very much more.



Obtaining sql ranks with Rac.

This is intended to show how Rac obtains equivalent sql ranks.

This example use the adventureworks query in Bol shown under
'Ranking Functions (Transact-SQL)' 
ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/tsqlref9/html/e7f917ba-bf4a-4fe0-b342-a91bcf88a71b.htm

An additional rank for ROW_NUMBER within PostalCode is added.

SELECT c.FirstName, c.LastName
    ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS 'Row Number'
--Rank for LastName,FirstName Partitioned by Postcode    
    ,ROW_NUMBER() OVER (PARTITION BY a.PostalCode ORDER BY c.LastName,c.FirstName) AS 'PART. Row Number'
    ,RANK() OVER (ORDER BY a.PostalCode) AS 'Rank'
    ,DENSE_RANK() OVER (ORDER BY a.PostalCode) AS 'Dense Rank'
    ,NTILE(4) OVER (ORDER BY a.PostalCode) AS 'Quartile'
    ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s 
    INNER JOIN Person.Contact c 
        ON s.SalesPersonID = c.ContactID
    INNER JOIN Person.Address a 
        ON a.AddressID = c.ContactID
WHERE TerritoryID IS NOT NULL 
    AND SalesYTD <> 0;
    
FirstName  LastName   Row Number PART. Row Number Rank Dense Rank SalesYTD    PostalCode      
---------- ---------- ---------- ---------------- ---- ---------- ----------- --------------- 
Maciej     Dusza      1          1                1    1          4557045.05  98027
Shelley    Dyck       2          2                1    1          5200475.23  98027
Linda      Ecoffey    3          3                1    1          3857163.63  98027
Carla      Eldridge   4          4                1    1          1764938.99  98027
Carol      Elliott    5          5                1    1          2811012.72  98027
Jauna      Elson      6          1                6    2          3018725.49  98055
Michael    Emanuel    7          2                6    2          3189356.25  98055
Terry      Eminhizer  8          3                6    2          3587378.43  98055
Gail       Erickson   9          4                6    2          5015682.38  98055
Mark       Erickson   10         5                6    2          3827950.24  98055
Martha     Espinoza   11         6                6    2          1931620.18  98055
Janeth     Esteves    12         7                6    2          2241204.04  98055
Twanna     Evans      13         8                6    2          1758385.93  98055    


Sql Rank                                   Rac Rank
ROW_NUMBER() OVER (ORDER BY a.PostalCode)  Rac by default computes a ROW_NUMBER over the
                                           table in the sort order implied by @rows or
                                           explicitly specified in @rowsort. It is 
                                           referenced by @tablecounter if there is no
                                           @select or by rd in a select statement.

RANK() OVER (ORDER BY a.PostalCode)        @rowruns='dumy^rank() within tablecounter^(dumy){PostalCode}'
                                           Rac does not have a default for RANK() as accounting
                                           for ties is seldomed used or needed. But it can be obtained 
                                           easily using a form of the Rac running sums parameter
                                           @rowruns. 
DENSE_RANK() OVER (ORDER BY a.PostalCode)  PostalCode{[Dense Rank]}
                                           A trivial case of a column (sql) and a partition (Rac)
                                           having the same semantics.
NTILE(4) OVER (ORDER BY a.PostalCode)      Rac has no direct equivalent expression. It can be computed
                                           using a combination of Rac options but it seems seldom
                                           called for.

The sql rank:

ROW_NUMBER() OVER (PARTITION BY a.PostalCode ORDER BY c.LastName,c.FirstName)


is equivalant to the Rac @rowcounters='PostalCode{[PART. Row Number]}'. The implied
sort order in @rows, PostalCode,LastName,FirstName, gives the Rac @rowcounters
specification the same semantics as this ROW_NUMBER.

Regardless of the order columns are specified in @rows, the data can
be presented/manipulated at will via the @select parameter.

Here is the Rac version which returns the same result as the adventureworks query.

Exec Rac
@transform='_dummy_',
@rows='a.PostalCode & c.LastName & c.FirstName & s.SalesYTD',
@pvtcol='Report Mode',
@from='Sales.SalesPerson s 
    INNER JOIN Person.Contact c 
        ON s.SalesPersonID = c.ContactID
    INNER JOIN Person.Address a 
        ON a.AddressID = c.ContactID',
@where='TerritoryID IS NOT NULL AND SalesYTD <> 0',
@rowbreak='n',@defaults1='y',@racheck='y',
@tablecounter='y',
@rowindicators='PostalCode{[Dense Rank]}',
@rowcounters='PostalCode{[PART. Row Number]}',
@rowruns='dumy^rank() within tablecounter^(dumy){PostalCode}',
@select='select FirstName,LastName,rd as [Row Number],[PART. Row Number],
            1*runs as RANK,[Dense Rank],SalesYTD,PostalCode 
               from rac
                 order by rd'

Saturday, September 02, 2006

RAC - Are you Coordinated?

Is not the concept of a 'coordinate' a clear and powerful
idea? The entire world is mapped by the coordinates of
latitude and longitude. You can pinpoint any place in the
world given its coordinates. A primary key on a table is
a coordinate. The key identifies preciously the row in
the table. If you can create the coordinates of any 
information you have created the ability to preciously
identify anything about it. Especially when the information
is data.

A frequent problem is finding a particular occurrence of a
'some thing' in a string. The 'some thing' is frequently
a number. The problem becomes finding the Nth occurrence
of a number, any number. For example, given the string:

'ASF--124 RW!==RXT34146$==PFS773044Y -JPFK*5435DAD '

extract the 3rd occurrance of a number, ie. '773044'.

Often the solution to this is to build a forrest of string
functions. But it is too easy to get lost in such a
forrest:) But, if we can map the string with coordinates
the problem and solution become clear, precise and
simple. So how do we create coordinates and what
form do they take? Well we can transpose the string to
rows and create a set of coordinates for each row, 
ie. each character. The rows/characters correspond to
the string as we read it from left to right. Once the
string is transposed we'll create 2 coordinates. The
first one is simply an indicator for whether or not
the character is a digit. This makes sense since
conceptually we only care if a character is a digit
(part of a number) or is a non-digit, not part of
a number. We're not concerned about individual
characteristics of any character, right?:) Now the
2nd coordinate, well that's going to be a bit of magic.
And this bit of magic is going to take the form of
a 'rank'. I'm now going to put RAC to use to develop
a very particular but easy rank.

I'll be using:
MS Sql Server 2005 sp1
RAC v2.34 beta
For details on RAC go here.

First some test data:

create table RT (rowid int primary key,messystr varchar(30))
insert RT values(1,'12-%= 312 ')
insert RT values(2,'YTW$152&OP8393&K734680UF104')
insert RT values(3,'624/!/53RF274UT56')
insert RT values(4,'342')
insert RT values(5,'ORDER')
insert RT values(6,'\\8AApEFS30902$1')
insert RT values(7,'4#7/&91A56G7VW37@&&&&')

select * from RT

rowid       messystr                       
----------- ------------------------------ 
1           12-%= 312 
2           YTW$152&OP8393&K734680UF104
3           624/!/53RF274UT56
4           342
5           ORDER
6           \\8AApEFS30902$1
7           4#7/&91A56G7VW37@&&&&

We're going to start by splitting each string character
by character per rowid. We're also going to create the
first coordinate, a new column, that's 1 if the character
is a digit and 2 for everything else. This is column 'type'.
Column pos is the position (from left to right) of the
character ('str1') in 'messystr'. Column 'rd' is generated
by Rac and is an ascending integer over the whole table
that follows the sort order by 'rowid' and 'pos'. The
result is saved in table ##t1.

Exec Rac
@split='[position]',
@rows='rowid & [position]', 
@pvtcol='messystr',@rank='str',
@splitby='1', -- Splits a string by N consecutive positions.
@from='RT',@defaults1='y',@rowbreak='n',@racheck='y',
@select='select rd,1*rowid as rowid,1*position as pos,str1,
         case when ascii(str1) between 48 and 57 then 1 else 2 end as type
         into ##t1
           from rac
             order by rd'

Here is the result for 'rowids' 1 and 6.

select * from ##t1 where rowid in (1,6)

rd          rowid       pos         str1 type        
----------- ----------- ----------- ---- ----------- 
1           1           1           1    1
2           1           2           2    1
3           1           3           -    2
4           1           4           %    2
5           1           5           =    2
6           1           6                2
7           1           7           3    1
8           1           8           1    1
9           1           9           2    1
10          1           10               2
63          6           1           \    2
64          6           2           \    2
65          6           3           8    1
66          6           4           A    2
67          6           5           A    2
68          6           6           p    2
69          6           7           E    2
70          6           8           F    2
71          6           9           S    2
72          6           10          3    1
73          6           11          0    1
74          6           12          9    1
75          6           13          0    1
76          6           14          2    1
77          6           15          $    2
78          6           16          1    1

Now for our 2nd coordinate, a rank. The 'type' varies
within 'rowid' ordered by 'rd'. We want a rank that
increments only when 'type' changes within 'rowid'
and consequently stays the same within 'type'. Within
a 'rowid' the rank can range from 1 to N. When the
'rowid' changes the rank resets to 1. This kind of
rank is obtained using the Rac @rowindicators parameter.

Exec Rac
@transform='_dummy_',
@rows='rowid & type & pos & str1',
@rowsort='rd',
@pvtcol='Sql*Plus',@defaultexceptions='dumy',
@from='##t1',@defaults1='y',@rowbreak='n',@racheck='y',
@rowindicators='type{typerank}',@counterdatatype='int'

Here is the result for rowid's 1 and 6.

rowid  type pos  str1 typerank    
------ ---- ---- ---- ----------- 
1      1    1    1    1   1st occurrence @1
1      1    2    2    1   1st occurrence @1
1      2    3    -    2
1      2    4    %    2
1      2    5    =    2
1      2    6         2
1      1    7    3    3
1      1    8    1    3
1      1    9    2    3
1      2    10        4
6      2    1    \    1
6      2    2    \    1
6      1    3    8    2    1st occurrence @2
6      2    4    A    3
6      2    5    A    3
6      2    6    p    3
6      2    7    E    3
6      2    8    F    3
6      2    9    S    3
6      1    10   3    4
6      1    11   0    4
6      1    12   9    4
6      1    13   0    4
6      1    14   2    4
6      2    15   $    5
6      1    16   1    6

With 'type' and 'typerank' we now have the coordinates
of any occurrence of a number. If you examine the
table closely you should be able to come up with the
formulation of it:) Ok here it is:

The Nth occurrence of a number has coordinates:

'type'=1 and

'typerank' in ((2*N)-1 , 2*N)
 
1st occurrence
(2*1)-1 , 2*1 : in (1,2)
 
2nd occurrence
(2*2)-1 , 2*2 : in (3,4)
 
3rd occurrence
(2*3)-1 , 2*3 : in (5,6)
 
4th occurrrence
(2*4)-1 , 2*4 : in (7,8)

Note that 'typerank' is an 'or' not an 'and'. You are
guaranteed to have a number in one but not both:)
For example, for the 1st occurrence the number will
have a 'typerank' of 1 if it is not proceeded by 
non-digits (see 'rowid' 1 above). If non-digits come
first the 1st occurrence is at 'typerank' 2 (see 'rowid'
6 above). The same logic applies to any Nth occurrence.

So now we can put a variable filter in RAC that accepts
the computed coordinates for 'typerank' for any particular
occurrence we want/pass to it (a form of macro substitution
for stored procedure parameter arguments). The filter
is contained in the @wherecounters parameter.

For example, we'll get the rows for the 2nd occurrence
of a number.

declare @N int,@N1 int,@N2 int
-- Only the value of @N is needed. 
set @N=2
set @N1=(@N*2)-1
set @N2=@N*2
--
Exec Rac
@transform='_dummy_',@user1=@N1,@user2=@N2, 
@rows='rowid & type & pos & str1',
@rowsort='rd',
@pvtcol='Sql*Plus',@defaultexceptions='dumy',
@from='##t1',@defaults1='y',@rowbreak='n',@racheck='y',
@rowindicators='type{typerank}',@counterdatatype='int',
--
-- Use coordinates to filter rows. @N1 and @N2
-- will be substituted for @user1 and @user2.
--
@wherecounters='type=1 and typerank in(@user1,@user2)',
@select='select rd,1*rowid as rowid,1*pos as pos,str1
         into ##t2
         from rac
          order by rd'

select * from ##t2

rd          rowid       pos         str1 
----------- ----------- ----------- ---- 
7           1           7           3
8           1           8           1
9           1           9           2
21          2           11          8
22          2           12          3
23          2           13          9
24          2           14          3
44          3           7           5
45          3           8           3
72          6           10          3
73          6           11          0
74          6           12          9
75          6           13          0
76          6           14          2
81          7           3           7

Now we only have to concatenate the rows within 'rowid'
to return the number. This is very simple with Rac.

Exec Rac
@transform='Max(str1) as str',
@rows='rowid',
@pvtcol='pos',
@from='##t2',
@defaults1='y',@racheck='y',
@concatenate='str',@stringname='number',
@separator='',@cutpvt='y'

rowid  number 
------ ------ 
1      312
2      8393
3      53
6      30902
7      7

And there you have it. Of course the whole process can 
be streamlined.

So are you going to get coordinated or are you going to
continue to go around in circles? :)


To paraphrase Gordon Geiko from 'Wall Street':
'Abstraction is good, it clarifies and purifies...'
It allows you to solve a problem, an end instead of
solving a means and then an end. There is something
to be said about qualitatively different solutions.
Precisely what I am unsure, but I think it something
very good.

We need more abstraction in data management. That is
why I am an advocate of Dataphor and the D4 language.
That Rac doesn't have a lot of company is not a good
thing, it's a bad thing. That a company like MS has
a vested interest in not fostering abstraction is a
bad thing. The industry continues to flounder in its
own sea of code and contradictions. That is not a good thing.

Find your way to show another way.

(Added 12/20/2008
 Note to see a much easier solution to this problem using Dataphor see:
'Extracting numbers from a string with Dataphor')

Friday, September 01, 2006

Dataphor - Get LiT!

LiT: = lists into tables:)

I'm going to talk about a bunch of kewl stuff as it
relates to LiT. So get relaxed and come take a ride on
the D(4) train:)

As I showed here the D4 language makes it very easy to split
strings. Now I want to show you some more kewl stuff about
working with strings, lists and tables using the Split operator.

Lets start with this string delimited by the pipe ('') character:

'FFBB  HH HH  D AA'

Notice that blanks can appear at the beginning, end and within a
string part.

The Spit operator takes a string and splits it into a 'list' of
strings based on one or more delimiters. Conceptually this means
that the above string is transformed into:

'FF',' BB',' HH HH ',' D ',' AA'

in other words a comma delimited list of strings. And with a list
you can access and operate on the individual strings parts. The
concept of a list is conceptually close to what concept in a database?
Yep, you guessed it - a table! Each string in the list fits nicely
with the idea of a row in a table. Make sense? And once the list is
in a table, well we can do just about anything with it as you'll see:)

So lets create a table from our example string using Split.

The basic construction of Split using the method style is:

'somestring'.Split({delimiters})

With our example string, which uses the pipe as a delimiter, we have:

'FFBB  HH HH  D AA'.Split({''})

The Split operator uses a thingie called an 'indexer' to access the
individual parts of the list, like this:

'some string'.Split({delimiters})[index]

where index is a digit for the desired element of the list. So if we
want the 1st string/element in the list we use:

'FFBB  HH HH  D AA'.Split({''})[0]

We're using 0 because the indexer is 0 based, ie. 1st element 0, 2nd
element 1 etc. Real easy. Ok we've got enough info now to select
the 1st string:

select 'FFBB  HH HH  D AA'.Split({''})[0]

FF

Ok for the moment lets leave the idea of getting individual parts
of the list with the indexer and think of getting 'the whole list'
with one swipe of the magical D4 wand. And that wand is the D4
ToTable. Since Split has done the heavy lifting of creating the
list, all we have to do to get LiT  is this: 

select ToTable( 'FFBB  HH HH  D AA'.Split({''}))

Value  
-------
 D     
 HH HH 
AA     
BB     
FF  

and we are Lit (list in table) :)

When you transform a list to a table the column by default is 'Value'.
You can name the column anything you want by including a column
name in Split. I'll call the column 'STRING':

select ToTable( 'FF GG HH HH  D  AA'.Split({''}),'STRING')

STRING 
-------
 D     
 HH HH 
AA     
BB     
FF     

Ok we got the list in a table but you see that the strings have the
same blanks as they had originally. Since we are LiT we can trim the
leading and trailing blanks of each row (string) using :

{Trim(STRING) STRING}

This means trim each row and retain the same column name.

select ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' )
                                         {Trim(STRING) STRING}

STRING
------
AA    
BB    
D     
FF    
HH HH 

Ok at this point your head is probably ready to explode since
you can see that while I've made all this fuss about LiT,
the rows of the table does not match what the order of strings
should be! In other words, if Split creates a list shouldn't
the table reflect the strings like:

STRING
------
FF
BB
HH HH
D
AA

If this has occurred you I'm doing my job:) As I said above
we can use what's called an indexer to address the elements
of the list in order. And we'll do that soon. But, it's
meaningful to get LiT even though the elements are not in
the proper sequence.

Just understand that without supplying specific information
to Split about the individual elements (the indexer thingie [])
we can only get a table whose rows reflect the ordering of the
string elements themselves.

Here we add a system generated auto-incrementing column INDEX
using the GetNextGenerator operator that reflects the ascending
order of STRING. (The INDEX column is similar to the identity
function used in MS Sql Server). You can see that INDEX reflects
the original (untrimmed) strings.

select (ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' ))
        add {GetNextGenerator('INDEX') INDEX,Trim(STRING) STRING1}
                                                   
STRING  INDEX STRING1
------- ----- -------
 D      1     D      
 HH HH  2     HH HH  
AA      3     AA     
BB      4     BB     
FF      5     FF     


Here we get the trimmed strings and then add the auto-incrementing column.

select (
        (ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' ))
                         {Trim(STRING) STRING}
       )
         add { GetNextGenerator('INDEX') INDEX }
 
STRING INDEX
------ -----
AA     1    
BB     2    
D      3    
FF     4    
HH HH  5    

The INDEX now reflects the ascending order of the trimmed
strings. No matter how you cut it, splitting a string
in toto into a table is not going to give you the order
as they appear in the original string. As I said even though
this is true it still has value and significance. Now I'm
going to show you why.

Here are 2 tables created on MS Sql Server 2005. We're
going to access them with Dataphor using D4.

create table str1 (ID1 int primary key,STRING1 varchar(50))
insert str1 values(1,'AABB CC')
insert str1 values(2,'FFGGHH MM J KK')
insert str1 values(3,'SSTTUUVV')
insert str1 values(4,'AAYYYYPQQRDD')
insert str1 values(5,'D10B20A30')

create table str2 (ID2 int primary key,STRING2 varchar(50))
insert str2 values(1,'AA  BB CC')
insert str2 values(2,'FF GG HH MM  J  KK')
insert str2 values(3,'SSTTUUXX')
insert str2 values(4,'AAPPPPWRQRDD')
insert str2 values(5,'D10B20')

Using D4:

select str1

ID1 STRING1           
--- ------------------
1   AABB CC         
2   FFGGHH MM J KK
3   SSTTUUVV       
4   AAYYYYPQQRDD  
5   D10B20A30       

select str2

ID2 STRING2               
--- ----------------------
1   AA  BB CC           
2   FF GG HH MM  J  KK
3   SSTTUUXX           
4   AAPPPPWRQRDD      
5   D10B20             

Lets suppose we want a query that compares by ID STRING1
and STRING2 as lists. We want the ID's where the lists
are the same. In other words, we want the ID's where all
elements in STRING1 and STRING2 are the same (equal). So
we're looking for ID's 1 and 2 where the trimmed elements
are the same.

You'll recall that we LiT a string using:

select ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' )
                                         {Trim(STRING) STRING}
                                           
All we have to do to LiT our strings is use STRING1 and
STRING2 in Split. And once we have LiT both strings we
can wave a magic wand. We can directly test to see if the
2 tables are equal! We simply join by ID's and then test
the tables for equality in where.

select 
 str1 join str2 by ID1=ID2
  where 
   ((ToTable(STRING1.Split({''}),'STRING')) {Trim(STRING) STRING})
   =
   ((ToTable(STRING2.Split({''}),'STRING')) {Trim(STRING) STRING})
       
ID1 STRING1            ID2 STRING2               
--- ------------------ --- ----------------------
1   AABB CC          1   AA  BB CC           
2   FFGGHH MM J KK 2   FF GG HH MM  J  KK

Comparing tables is just a logical extension of comparing
integers or strings or just about anything else. Tables
are equal if they have the same column(s) (STRING) and the
elements (rows) are equal. Is that a magic wand or what:)
Now it's fundamentally important to understand how this
works. The key of each table is of course STRING. Each
key also implies an order.  So each row, ordered by
the key (STRING), in each table is compared to the
corresponding row in the other table just like you'd
compare integers. If every row is equal in the 2 tables
the comparison is true, otherwise it's false. D4 is
thus able to logically address the rows in a table via
the key. This is fundamentally different than sql which
can't do this. The ability to use keys for logical
addressing has big, big implications. It's one of the
fundamental things that differentiates a relational
language from sql! I'll get into it in future posts.
Now back to our example.

Note that for ID's 5 that have 3 and 2 rows respectively,
the table comparison was no problem.
I figure that if your still with me your interested and
pretty sharp. So I'll give you the benefit of the doubt
and assume a pretty important question has occured to
you. What if the original string has duplicate elements?
For example:

'AAATTCCAAA'

Lets see what happens when we try to LiT this:

select ToTable ( 'AAATTCCAAA'.Split({''}),'STRING' )
                        
WoW! We get this from the Dataphor compiler:
'Internal Index Error: "Duplicate key violation." '

Yeah we're SOL but does it makes sense? Well all tables
in D4 have to have a key. No if, ands or buts. The key
is of course STRING and when the string is split into
a table there are duplicates of 'AAA' and there
can't be any duplicate rows. So there goes our table:(
I'm sure you can guess the remedy:) But remember that
D4 is a strongly typed language. A list is not the
same as a table. We've been trying to convert a list to
a table. Too different types, that is the point of
having to convert. And sometimes it just won't work
the way we want it to. Before we remedy this, things
could get more dicey if we have this:

select ToTable ( 'AAA TTCC AAA '.Split({''}),'STRING' )

STRING
------
 AAA  
AAA   
CC    
TT    

And then trimmed the strings:

select ToTable ( 'AAA TTCC AAA '.Split({''}),'STRING')
                         {Trim(STRING) STRING}
                        
STRING
------
AAA   
CC    
TT    

An element 'AAA' was eliminated (no duplicates allowed) so
we end up with 3 elements instead of 4. And if we compared
this with

'AAATTCC'

select TableDee add {'False Positive' Compare}
   where
     (ToTable ( 'AAA TTCC AAA '.Split({''}),'STRING')
                         {Trim(STRING) STRING} )
     =
     (ToTable ( 'AAATTCC'.Split({''}),'STRING')
                         {Trim(STRING) STRING} )

Compare       
--------------
False Positive

we'd get a false positive. The tables are equal but
the lists are not! Phew:) (Don't worry about the
'TableDee' thingie, I'll get to that in due time:)

But please come to the idea that table comparisons
are a magic wand and once you start using them you'll
wonder how you ever got along without them. But alas,
you can loose some of that magic when working with
lists so be careful:) (More on table comparisons here).

So finally lets get to splitting the list by the
individual elements, ie. using an indexer ([]).

On the MS Server newsgroups it's common to see the
phrase 'use a set oriented approach'. This means
don't use procedural/imperative code. So no
cursors, or loops:)
Now I've already gone over splitting strings here.
But I want to show you in a little more depth what
is going on.

Given a string:

'AAA TTCC AAA '

A basic query to get the elements of the list is:

select ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')
  where 'AAA TTCC AAA '.Split({''}).Count()>INDEX
   {INDEX+1 INDEX,'AAA TTCC AAA '.Split({''})[INDEX].Trim() STRING}

INDEX STRING
----- ------
1     AAA   
2     TT    
3     CC    
4     AAA   

Note that with the inclusion of INDEX we are guaranteed
not to eliminate duplicate STRINGs. The INDEX column
is always unique and takes part in the key of the
table.

Now the statement:

ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')

creates a table with column INDEX for the list of integers.

'AAA TTCC AAA '.Split({''}).Count()>INDEX

filters on INDEX where the count of the elements in the list
is greater than the INDEX value. There are 4 elements in
the list so INDEXs 0, 1, 2, 3 satisfy this criteria. Remember
lists are 0 based. So we end up with a table with 4 rows, ie:

INDEX
-----
0    
1    
2    
3    

Then the statement:

{INDEX+1 INDEX,'AAA TTCC AAA '.Split({''})[INDEX].Trim() STRING}

is applied. This statement modifies the above table by redefing
INDEX and adding the column STRING. The STRING column is each
list element, in order, identified by the indexer ([INDEX]).
Each element of the string is also trimmed. We added 1 to INDEX
because.. well I wanted too because I like to start from 1 not 0:)
So there you have it. Well...not quite. There's stuff going on
here under the covers that you should know about and understand.

How is it that Dataphor interprets:

{0, 1, 2, 3, 4, 5, 6, 7, 8 }

as a list of integers and not something else?

To answer this let me show you formally how this table of
integers is created in a block of D4 code.

begin
   var Indexes : list(Integer);
   Indexes := { 0, 1, 2, 3, 4 , 5, 6, 7, 8 };
   create session table Indexes
   from
   ToTable(Indexes,'INDEX');
end;

A variable, Indexes, is assigned the type list(Integer), ie.
an infinite list where the elements are of type integer.
Once variable Indexes is defined as a type of list, the actual
elements of the list (the integers) are assigned to it.
Now the list can be converted into a table. The

create session table Indexes
from

creates what is called a session table. It's like a permanent
table but is only available to the Dataphor session that
creates it. It's like an MS Sql Server ## temp table.

select Indexes

INDEX
-----
0    
1    
2    
.
8

The answer to the question of how Dataphor knows that:

ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')

is a list of integers is because the system has inferred
it! That's why it isn't necessary to formally declare
any variable with a type of list(Integer). In other
words, a comma separated bunch of integers is logically
a list of integers. The D4 compiler as an inference
engine is a super big deal!:) The concept of logically
inferring metadata (ie.keys, constraints) does not
exist in an sql database. I'll get into this big
deal in future posts. Ok, back to our example:)

So now you know some more about the integer list in
our query:

select ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')
  where 'AAA TTCC AAA '.Split({''}).Count()>INDEX
   {INDEX+1 INDEX,'AAA TTCC AAA '.Split({''})[INDEX].Trim() STRING}

And what about the argument to Split? Well here is a
formal code block for splitting a string (using the
above Indexes table).

begin
  var LList : list(String);
  var ExString : String;
  ExString := 'AAA TTCC AAA ';
  LList := ExString.Split( {''} );
create session table ExList
from
 Indexes
   where LList.Count()>INDEX
    {INDEX+1 INDEX, LList[INDEX].Trim() STRING} ;
end;  

select ExList

INDEX STRING
----- ------
1     AAA   
2     TT    
3     CC    
4     AAA   

We don't have to explicitly declare that Split should
result in a list of String elements. All this information
is inferred by the compiler (thank you very much D4:)

I'm rambled enough:) Now we're going to finally,
finally split strings and generate a little report.

For good measure we're going to insert another row
in our string tables, str1 and str2, via D4:

insert
 table
 {
     row { 6 ID1, 'AAA TTCC AAA ' STRING1 }
 }
 into str1

insert
 table
 {
     row { 6 ID2, 'TTCCAAA' STRING2 }
 }
 into str2

So now have:

select str1
ID1 STRING1           
--- ------------------
1   AABB CC         
2   FFGGHH MM J KK
3   SSTTUUVV       
4   AAYYYYPQQRDD  
5   D10B20A30       
6   AAA TTCC AAA    

select str2
ID2 STRING2               
--- ----------------------
1   AA  BB CC           
2   FF GG HH MM  J  KK
3   SSTTUUXX           
4   AAPPPPWRQRDD      
5   D10B20               
6   TTCCAAA       

We want to compare strings by ID and only where
the lists have the same number/count of elements. And
we only want the strings where there are differences
for the same element (INDEX). So each row will include
the ID, the corresponding different STRINGs, the
INDEX for each different element and the individual
different string parts.

select
 (
  (
   (  
    (str1 join str2 by ID1=ID2)
      where STRING1.Split({''}).Count()=STRING2.Split({''}).Count()
   )
    times // times is like the sql cross join.
     Indexes
      where STRING1.Split({''}).Count()>INDEX
  )  
      where // <> is the notation for not equal.
       STRING1.Split({''})[INDEX].Trim()<>STRING2.Split({''})[INDEX].Trim()
 )    
          {
            ID1 ID,STRING1,STRING2,INDEX+1 INDEX,
            STRING1.Split({''})[INDEX].Trim() DIF1,
            STRING2.Split({''})[INDEX].Trim() DIF2
          }

ID STRING1          STRING2          INDEX DIF1 DIF2
-- ---------------- ---------------- ----- ---- ----
3  SSTTUUVV      SSTTUUXX      4     VV   XX  
4  AAYYYYPQQRDD AAPPPPWRQRDD 2     YYYY PPPP
4  AAYYYYPQQRDD AAPPPPWRQRDD 3     PQ   WR  

Coming from sql you should think of each set of matching
parenthesis (indented) as a derived table. Unlike sql,
a derived table in D4 does not start with a 'select'
statement. In D4 it's just the opposite of sql. A table
is derived and then you specify the columns you want.
Which way makes more sense? In sql you pay for the
meal before you even order it:) Also note in D4 you
don't have to give the derived able a name (alias). In
D4 things are much cleaner and more logical.

I hope you found this interesting. And there's plenty
more to come:)

About Me

My photo
Phoenix, Arizona, United States