Home » SQL & PL/SQL » SQL & PL/SQL » Traverse CLOB with Connect By (Oracle, 19.0.0.0.0, Linux)
Traverse CLOB with Connect By [message #689570] Thu, 15 February 2024 07:26 Go to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
I'm trying to see if this is possible by using a "Connect By" without having to loop over the CLOB.

Result:  Insert the values of the CLOB into a table.

Data:


ID,Term,Recruiter||chr(10)||chr(13)
A01,202420,Name 1||chr(10)||chr(13)
A0102,202420,Name 2||chr(10)||chr(13)
A010203,202420,Name 3||chr(10)||chr(13)
A01020304,202420,Name 4||chr(10)||chr(13)
A01,202420,Name 5||chr(10)||chr(13)
A02,202420,Name 6||chr(10)||chr(13)
A03,202420,Name 7||chr(10)||chr(13)
A04,202420,Name 8||chr(10)||chr(13)
A05,202420,Name 9||chr(10)||chr(13)


Table:


CREATE TABLE FILE_CONTENTS
(
  ID         VARCHAR2(10 CHAR),
  TERM       VARCHAR2(6 CHAR),
  RECRUITER  VARCHAR2(50 CHAR)
)


Query:


select txt
  from (select 'ID,Term,Recruiter'||chr(10)||chr(13)||
               'A01,202420,Name 1'||chr(10)||chr(13)||
               'A0102,202420,Name 2'||chr(10)||chr(13)||
               'A010203,202420,Name 3'||chr(10)||chr(13)||
               'A01020304,202420,Name 4'||chr(10)||chr(13)||
               'A01,202420,Name 5'||chr(10)||chr(13)||
               'A02,202420,Name 6'||chr(10)||chr(13)||
               'A03,202420,Name 7'||chr(10)||chr(13)||
               'A04,202420,Name 8'||chr(10)||chr(13)||
               'A05,202420,Name 9'||chr(10)||chr(13) txt
          from dual)
    connect by level <= regexp_count(txt,'([^\'||chr(10)||chr(13)||']+)')


Final Result: CLOB values are inserted into Table.


FILE_CONTENTS

ID           TERM      RECRUITER
A01          202420    Name 1
A0102        202420    Name 2
A010203      202420    Name 3
A01020304    202420    Name 4
A01          202420    Name 5
A02          202420    Name 6
A03          202420    Name 7
A04          202420    Name 8
A05          202420    Name 9

Re: Traverse CLOB with Connect By [message #689571 is a reply to message #689570] Thu, 15 February 2024 08:32 Go to previous messageGo to next message
Michel Cadot
Messages: 68647
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
Split CLOB into lines:
SQL> with
  2    data as (
  3      select 'ID,Term,Recruiter'||chr(10)||chr(13)||
  4             'A01,202420,Name 1'||chr(10)||chr(13)||
  5             'A0102,202420,Name 2'||chr(10)||chr(13)||
  6             'A010203,202420,Name 3'||chr(10)||chr(13)||
  7             'A01020304,202420,Name 4'||chr(10)||chr(13)||
  8             'A01,202420,Name 5'||chr(10)||chr(13)||
  9             'A02,202420,Name 6'||chr(10)||chr(13)||
 10             'A03,202420,Name 7'||chr(10)||chr(13)||
 11             'A04,202420,Name 8'||chr(10)||chr(13)||
 12             'A05,202420,Name 9'||chr(10)||chr(13) txt
 13      from dual
 14    )
 15  select regexp_substr(txt, '[^'||chr(10)||chr(13)||']+', 1, level) line
 16  from data
 17  connect by level <= regexp_count(txt,chr(10)||chr(13))
 18  /
LINE
---------------------------------------------------------------------------
ID,Term,Recruiter
A01,202420,Name 1
A0102,202420,Name 2
A010203,202420,Name 3
A01020304,202420,Name 4
A01,202420,Name 5
A02,202420,Name 6
A03,202420,Name 7
A04,202420,Name 8
A05,202420,Name 9

10 rows selected.

Split lines into columns:
SQL> col col1 format a10
SQL> col col2 format a10
SQL> col col3 format a10
SQL> with
  2    data as (
  3      select 'ID,Term,Recruiter'||chr(10)||chr(13)||
  4             'A01,202420,Name 1'||chr(10)||chr(13)||
  5             'A0102,202420,Name 2'||chr(10)||chr(13)||
  6             'A010203,202420,Name 3'||chr(10)||chr(13)||
  7             'A01020304,202420,Name 4'||chr(10)||chr(13)||
  8             'A01,202420,Name 5'||chr(10)||chr(13)||
  9             'A02,202420,Name 6'||chr(10)||chr(13)||
 10             'A03,202420,Name 7'||chr(10)||chr(13)||
 11             'A04,202420,Name 8'||chr(10)||chr(13)||
 12             'A05,202420,Name 9'||chr(10)||chr(13) txt
 13      from dual
 14    ),
 15    lines as (
 16      select regexp_substr(txt, '[^'||chr(10)||chr(13)||']+', 1, level) line
 17      from data
 18      connect by level <= regexp_count(txt,chr(10)||chr(13))
 19    )
 20  select regexp_substr(line, '[^,]+', 1, 1) col1,
 21         regexp_substr(line, '[^,]+', 1, 2) col2,
 22         regexp_substr(line, '[^,]+', 1, 3) col3
 23  from lines
 24  /
COL1       COL2       COL3
---------- ---------- ----------
ID         Term       Recruiter
A01        202420     Name 1
A0102      202420     Name 2
A010203    202420     Name 3
A01020304  202420     Name 4
A01        202420     Name 5
A02        202420     Name 6
A03        202420     Name 7
A04        202420     Name 8
A05        202420     Name 9

10 rows selected.
Re: Traverse CLOB with Connect By [message #689572 is a reply to message #689571] Thu, 15 February 2024 09:06 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
That's what I needed.  Thank you so much.
Re: Traverse CLOB with Connect By [message #689573 is a reply to message #689572] Thu, 15 February 2024 09:58 Go to previous messageGo to next message
Michel Cadot
Messages: 68647
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator

Note that this assumes there are no chr(10) or chr(13) in your column data, that is there are only at the end of lines.

Re: Traverse CLOB with Connect By [message #689574 is a reply to message #689573] Thu, 15 February 2024 10:40 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Already have a working example.  Just working on not selecting the HEADER record so it doesn't get inserted into the Table.

[Updated on: Thu, 15 February 2024 10:41]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689575 is a reply to message #689574] Thu, 15 February 2024 11:15 Go to previous messageGo to next message
Michel Cadot
Messages: 68647
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
SQL> col col1 format a10
SQL> col col2 format a10
SQL> col col3 format a10
SQL> with
  2    data as (
  3      select 'ID,Term,Recruiter'||chr(10)||chr(13)||
  4             'A01,202420,Name 1'||chr(10)||chr(13)||
  5             'A0102,202420,Name 2'||chr(10)||chr(13)||
  6             'A010203,202420,Name 3'||chr(10)||chr(13)||
  7             'A01020304,202420,Name 4'||chr(10)||chr(13)||
  8             'A01,202420,Name 5'||chr(10)||chr(13)||
  9             'A02,202420,Name 6'||chr(10)||chr(13)||
 10             'A03,202420,Name 7'||chr(10)||chr(13)||
 11             'A04,202420,Name 8'||chr(10)||chr(13)||
 12             'A05,202420,Name 9'||chr(10)||chr(13) txt
 13      from dual
 14    ),
 15    lines as (
 16      select regexp_substr(txt, '[^'||chr(10)||chr(13)||']+', 1, level) line, level rn
 17      from data
 18      connect by level <= regexp_count(txt,chr(10)||chr(13))
 19    )
 20  select regexp_substr(line, '[^,]+', 1, 1) col1,
 21         regexp_substr(line, '[^,]+', 1, 2) col2,
 22         regexp_substr(line, '[^,]+', 1, 3) col3
 23  from lines
 24  where rn > 1
 25  /
COL1       COL2       COL3
---------- ---------- ----------
A01        202420     Name 1
A0102      202420     Name 2
A010203    202420     Name 3
A01020304  202420     Name 4
A01        202420     Name 5
A02        202420     Name 6
A03        202420     Name 7
A04        202420     Name 8
A05        202420     Name 9

9 rows selected.
Re: Traverse CLOB with Connect By [message #689576 is a reply to message #689575] Thu, 15 February 2024 13:46 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
Since Oracle added support for JSON in Oracle 12.1 (and improved it with each new release), a more efficient way of splitting delimited strings emerged - transform the input string into a JSON document and use JSON_TABLE or similar for the splitting. Extensive testing performed by Stew Ashton (and confirmed by numerous other tests) shows that this approach is faster than other, older methods.

I will describe the solution below in some detail. You didn't tell us where the CLOB is coming from; in my illustration I create it on the spot in a WITH clause. (This would not work exactly as written for a true CLOB, meaning one of some size, but that isn't relevant for the current question.)

I only show a SELECT statement that splits everything as needed - you can then use this in an INSERT statement.

with
  input_data (txt) as (
    select to_clob(
'ID,Term,Recruiter'||chr(10)||chr(13)||
'A01,202420,Name 1'||chr(10)||chr(13)||
'A0102,202420,Name 2'||chr(10)||chr(13)||
'A010203,202420,Name 3'||chr(10)||chr(13)||
'A01020304,202420,Name 4'||chr(10)||chr(13)||
'A01,202420,Name 5'||chr(10)||chr(13)||
'A02,202420,Name 6'||chr(10)||chr(13)||
'A03,202420,Name 7'||chr(10)||chr(13)||
'A04,202420,Name 8'||chr(10)||chr(13)||
'A05,202420,Name 9'||chr(10)||chr(13))
    from dual
  )
, prep (json_arr) as (
    select '[' || replace(replace(json_array(substr(txt, 1, length(txt) - 2)), ',', '","'), '\n\r', '"],["') || ']'
    from   input_data
  )
select id, term, recruiter
from   prep nested json_arr, '$[*]' 
                   columns (ord       for ordinality,
                            id        varchar2(10 char) path '$[0]',
                            term      varchar2( 6 char) path '$[1]',
                            recruiter varchar2(50 char) path '$[2]')
where  ord > 1
;
The PREP subquery (inline view) converts the input (comma and chr10-chr13 delimited string) into a JSON array of JSON arrays. The inner-most transformation, the SUBSTR, is to remove the line-end marker from the last line (since JSON arrays use delimiters, not terminators). Then the resulting string is passed to JSON_ARRAY. "Array" in this context is not truly an array: we pass a single scalar value (a long string), and we get a JSON array with a single element. This does two things: it appends [" and prepends "] to the string, and it escapes JSON special characters. (Namely: [ and ", but also control characters like chr10 and chr13 - which is actually a slight complication.) If we could assume that the "real data" in the CLOB does not include " or [ and such, the code could be simplified a bit; but I think it is a good practice to write our code as if we did need to be careful.

Then we replace the commas in the original text with ",", and the chr10-chr13 pairs with "],[". This will cause the tokens in the original text to appear enclosed in double-quotes (remember that any double-quotes in the input tokens are now escaped!), and the "lines" of the original text to become sub-arrays in the global JSON array created earlier. These are just mindless text manipulations, they don't use JSON tools; but we were careful enough earlier that this step will work as needed. Two things to note: the REPLACE function now must hunt for \n\r, since the chr(10) and chr(13) characters were escaped when we applied JSON_ARR; and the order of the two calls to REPLACE is essential: we must replace the original commas first (appending and prepending " to each), and ONLY THEN apply the REPLACE to the line separators, because these will add new commas, and they must be handled differently from the original commas in the text.

Finally, we must append [ and prepend ] to the result, since the ones added by JSON_ARR have become the opening [ for the first nested array and the closing ] for the last nested array, respectively.

At this point, it would help to remove the main SELECT from the end of the code, and replace it with SELECT JSON_ARR FROM PREP to see exactly what we did so far.

The rest is a trivial application of JSON_TABLE to extract data from an array of arrays. I used the newer (simpler) syntax introduced in Oracle 19, but that's just for fun.

The condition ORD > 1 in the WHERE clause is the one that causes the headers to be excluded. I could have excluded the headers much earlier in the process, but I thought this gives more flexibility (what if tomorrow we must use this on a CLOB where the first three lines are header lines, etc.)
Re: Traverse CLOB with Connect By [message #689577 is a reply to message #689576] Thu, 15 February 2024 16:10 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
mathguy wrote on Thu, 15 February 2024 19:46
Since Oracle added support for JSON in Oracle 12.1 (and improved it with each new release), a more efficient way of splitting delimited strings emerged - transform the input string into a JSON document and use JSON_TABLE or similar for the splitting. Extensive testing performed by Stew Ashton (and confirmed by numerous other tests) shows that this approach is faster than other, older methods.
I used your example on a really large CLOB and I'm getting this error "[Error] Execution (1: 1): ORA-40478: output value too large (maximum: 4000)".  The CLOB I used has a data length of 66169.

Is there a work around for that?

Re: Traverse CLOB with Connect By [message #689578 is a reply to message #689577] Thu, 15 February 2024 19:10 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
I don't understand. I thought you already talked about "working example"; do you mean that this is in fact your first attempt at accessing the 66k+ character CLOB? Where do you get the error? I assume from the very beginning; do you get it if right after the first subquery in the WITH clause, you SELECT TXT FROM INPUT_DATA?

There are workarounds, but let's make sure I understand your question first. This is a distinct question, it has nothing to do with the rest of the problem. Or does it?
Re: Traverse CLOB with Connect By [message #689579 is a reply to message #689578] Thu, 15 February 2024 19:27 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
mathguy wrote on Fri, 16 February 2024 01:10
I don't understand. I thought you already talked about "working example"; do you mean that this is in fact your first attempt at accessing the 66k+ character CLOB? Where do you get the error? I assume from the very beginning; do you get it if right after the first subquery in the WITH clause, you SELECT TXT FROM INPUT_DATA?

There are workarounds, but let's make sure I understand your question first. This is a distinct question, it has nothing to do with the rest of the problem. Or does it?
That is correct.  I have a working example with data much smaller than the 66K+ character CLOB.  When I try the example with the 66K+ character CLOB then I get the error I pointed out.  I would say it's right at the beginning.

Distinct question.  Works fine with smaller data but not the 66K+ CLOB.

[Updated on: Thu, 15 February 2024 19:28]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689580 is a reply to message #689579] Thu, 15 February 2024 20:49 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Ok, I got this to work by adding "returning clob" within the "json_array()" statement.

This does return data so I must have something wrong in the json statement.  Actually, removing ORD > 1 returns the HEADER but no actual data.  Not sure what else is wrong.


with
  input_data (txt) as 
    (select file_contents txt
       from files
         where id = '1999'), 
  prep (json_arr) as 
    (select '[' || replace(replace(json_array(substr(txt, 1, length(txt) - 2) returning clob), ',', '","'), '\n\r', '"],["') || ']'
       from input_data)

select *
  from prep

Re: Traverse CLOB with Connect By [message #689581 is a reply to message #689580] Thu, 15 February 2024 21:22 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Must be the data.  I wonder if it's the difference between the test data having hard coded values of "||chr(10)||chr(13)||" and the production data having CR/LR.

Test Data:

Select * from prep


[["ID","Term","Recruiter"],["A01","202420","Name 1"],["A0102","202420","Name 2"],["A010203","202420","Name 3"],

66K+ Data:

Select * from prep


[["Campus","Complete","ip","program","term","id","\r\nDL"," "," ","57","209990","88","\r\nDL"," ","I","84","209990","17","\r\nDL"," ","I","50","209990","20","\r\nDL"," ","I","79","209990","26","\r\nDL"," ","I","92","209990","37","\r\nDL"," ","I","57","209990","47","\r\nDL","


I wonder if it's because the header has that extra ",".  Data also has an extra "," before the CR/LR.

HEADER:
Campus,Complete,ip,program,term,id,

[Updated on: Thu, 15 February 2024 21:24]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689582 is a reply to message #689581] Thu, 15 February 2024 23:46 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
Good catch on the RETURNING CLOB option to JSON_ARRAY. I forgot that the default is VARCHAR2.

Frankly chr(10) || chr(13) as the line terminator looked odd, but suum cuique as the saying goes. CRLF - that is chr(13) || chr(10) - is the line terminator in Windows; Unix and its descendants use chr(10), which was adopted by Oracle also, and I heard that "old" Mac OS used chr(13) (just for spite I suppose), until they joined the Unix gang; but I haven't seen LFCR - or chr(10) || chr(13) - as line terminator. (As an aside, note that chr(10) is known as LF, for Line Feed; there is no LR.) In any case, I wrote the code for the sample data you posted.

Of course, if the real-life data uses chr(13) || chr(10), then using them swapped in the code won't work. What happened when you changed that in the code? (You did try, right?)

The extra comma in each line (meaning that tokens are also "terminated" by comma, rather than just "delimited" or "separated") should not cause problems. That's because in the approach I showed, that extra comma causes each nested array to have one additional token in the last position, which is an empty string in all cases. In the JSON_TABLE application, I only select the first three elements from each such nested array; if there is (or there isn't) a fourth element, empty or otherwise, does not matter, since it's not referenced either way.

So, if these are the only deviations between the real-life data and the sample you gave us, then simply fixing the part of the code that refers to the line terminator should suffice. But, looking at what you posted - what is that DL after each \r\n at the end of each line? Is there something else in the raw file, in addition to chr(13) || chr(10) at the end of each line?

Where is the raw text? Is it a text file on the server? Under what operating system? Every operating system has tools to look at a raw text file to see what special characters it contains (in Oracle you would use the DUMP function); what do you see in the file - or in the table, using DUMP, if in the Oracle db already?

[Updated on: Thu, 15 February 2024 23:53]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689583 is a reply to message #689582] Fri, 16 February 2024 00:35 Go to previous messageGo to next message
Michel Cadot
Messages: 68647
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator

Quote:
Where is the raw text? Is it a text file on the server?

@mathguy, have a look at his previous topic: Read BLOB from Table into CLOB for further processing.

Re: Traverse CLOB with Connect By [message #689587 is a reply to message #689582] Fri, 16 February 2024 07:44 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
mathguy wrote on Fri, 16 February 2024 05:46
Good catch on the RETURNING CLOB option to JSON_ARRAY. I forgot that the default is VARCHAR2.

Frankly chr(10) || chr(13) as the line terminator looked odd, but suum cuique as the saying goes. CRLF - that is chr(13) || chr(10) - is the line terminator in Windows; Unix and its descendants use chr(10), which was adopted by Oracle also, and I heard that "old" Mac OS used chr(13) (just for spite I suppose), until they joined the Unix gang; but I haven't seen LFCR - or chr(10) || chr(13) - as line terminator. (As an aside, note that chr(10) is known as LF, for Line Feed; there is no LR.) In any case, I wrote the code for the sample data you posted.

Of course, if the real-life data uses chr(13) || chr(10), then using them swapped in the code won't work. What happened when you changed that in the code? (You did try, right?)

The extra comma in each line (meaning that tokens are also "terminated" by comma, rather than just "delimited" or "separated") should not cause problems. That's because in the approach I showed, that extra comma causes each nested array to have one additional token in the last position, which is an empty string in all cases. In the JSON_TABLE application, I only select the first three elements from each such nested array; if there is (or there isn't) a fourth element, empty or otherwise, does not matter, since it's not referenced either way.

So, if these are the only deviations between the real-life data and the sample you gave us, then simply fixing the part of the code that refers to the line terminator should suffice. But, looking at what you posted - what is that DL after each \r\n at the end of each line? Is there something else in the raw file, in addition to chr(13) || chr(10) at the end of each line?

Where is the raw text? Is it a text file on the server? Under what operating system? Every operating system has tools to look at a raw text file to see what special characters it contains (in Oracle you would use the DUMP function); what do you see in the file - or in the table, using DUMP, if in the Oracle db already?
"what is that DL after each \r\n at the end of each line?"  That is the first column of data.  See my data example.


Ok, I flipped the CR/LF in the query and ran it with a small number of records.  That worked with the 30 lines of data including the HEADER.

Data:


Campus,Complete,ip,program,term,id,
DL, , ,57,202330,88,
DL, ,I,84,202330,17,
DL, ,I,50,202330,20,
DL, ,I,79,202230,26,
DL, ,I,92,202330,37,
DL, ,I,57,202230,47,
DL, , ,57,202320,19,
DL, ,I,80,202330,26,
DL, ,I,92,202230,27,
DL, ,I,74,202330,33,
DL, ,I,80,202320,42,
DL, ,I,50,202330,46,
DL, ,I,92,202230,48,
DL, , ,57,202330,49,
DL, , ,57,202220,49,
DL, , ,57,202320,56,
DL, , ,57,202330,57,
DL, ,I,57,201810,58,
DL, ,I,84,202130,61,
DL, ,I,89,202330,61,
DL, ,I,84,202330,62,
DL, ,I,51,202330,65,
DL, ,I,50,202230,66,
DL, ,I,92,202320,70,
DL, , ,57,202330,70,
DL, ,I,92,202230,70,
DL, ,I,50,202330,74,
DL, ,I,50,202330,74,
DL, ,I,57,202330,74,

Query


with
  input_data (txt) as 
    (select file_contents txt
       from uploaded_files
         where id = '2'), 
  prep (json_arr) as 
    (select '[' || replace(replace(json_array(substr(txt, 1, length(txt) - 2) returning clob), ',', '","'), '\r\n', '"],["') || ']'
       from input_data)

select campus, complete, ip, program, term, id
  from prep nested json_arr, 
       '$[*]' columns (ord       for ordinality,
                       campus    varchar2(2 char)  path '$[0]',
                       complete  varchar2(1 char)  path '$[1]',
                       ip        varchar2(1 char)  path '$[2]',
                       program   varchar2(1 char)  path '$[3]',
                       term      varchar2(6 char)  path '$[4]',
                       id        varchar2(10 char) path '$[5]')

Taking the data above and copying it to be 2060 lines of data returned "[Error] Execution (1: 1): ORA-01406: fetched column value was truncated".


Code:


with
  input_data (txt) as 
    (select file_contents txt
       from uploaded_files
         where id = '3'), 
  prep (json_arr) as 
    (select '[' || replace(replace(json_array(substr(txt, 1, length(txt) - 2) returning clob), ',', '","'), '\r\n', '"],["') || ']'
       from input_data)

select campus, complete, ip, program, term, id
  from prep nested json_arr, 
       '$[*]' columns (ord       for ordinality,
                       campus    varchar2(2 char)  path '$[0]',
                       complete  varchar2(1 char)  path '$[1]',
                       ip        varchar2(1 char)  path '$[2]',
                       program   varchar2(1 char)  path '$[3]',
                       term      varchar2(6 char)  path '$[4]',
                       id        varchar2(10 char) path '$[5]')

Table that is used to store the CLOB results while testing.


CREATE TABLE UPLOADED_FILES
(
  ID             NUMBER,
  FILE_CONTENTS  CLOB
)

[Updated on: Fri, 16 February 2024 07:46]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689588 is a reply to message #689587] Fri, 16 February 2024 08:02 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Never mind.  It works with 30 lines or 2060 lines of data.  My problem was the HEADER.  Putting "ord > 1" solved my problem.

My only question now is about double quotations '"' around the data.

Data:


campus,complete,ip,program,term,id,
"DL", , ,57,"202330",88,
DL, ,I,84,202330,17,
DL, ,I,50,202330,20,

Result:  (term data is not returned)


"DL		 	57		88
DL	 	I	84		17
DL	 	I	50		20

Query:


with
  input_data (txt) as 
    (select file_contents txt
       from uploaded_files
         where id = '2'), 
  prep (json_arr) as 
    (select '[' || replace(replace(json_array(substr(txt, 1, length(txt) - 2) returning clob), ',', '","'), '\r\n', '"],["') || ']'
       from input_data)

select campus, complete, ip, program, term, id
  from prep nested json_arr, 
       '$[*]' columns (ord       for ordinality,
                       campus    varchar2(2 char)  path '$[0]',
                       complete  varchar2(1 char)  path '$[1]',
                       ip        varchar2(1 char)  path '$[2]',
                       program   varchar2(1 char)  path '$[3]',
                       term      varchar2(6 char)  path '$[4]',
                       id        varchar2(10 char) path '$[5]')
    where ord > 1;

[Updated on: Fri, 16 February 2024 08:03]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689589 is a reply to message #689588] Fri, 16 February 2024 08:15 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Corrected the "program" length.

Query:


with
  input_data (txt) as 
    (select file_contents txt
       from uploaded_files
         where id = '2'), 
  prep (json_arr) as 
    (select '[' || replace(replace(json_array(substr(txt, 1, length(txt) - 2) returning clob), ',', '","'), '\r\n', '"],["') || ']'
       from input_data)

select campus, complete, ip, program, term, id
  from prep nested json_arr, 
       '$[*]' columns (ord       for ordinality,
                       campus    varchar2(2 char)  path '$[0]',
                       complete  varchar2(1 char)  path '$[1]',
                       ip        varchar2(1 char)  path '$[2]',
                       program   varchar2(10 char) path '$[3]',
                       term      varchar2(6 char)  path '$[4]',
                       id        varchar2(10 char) path '$[5]')
    where ord > 1;

Result: (left " is left within the data but the right " is removed)


"DL		 	57	"202330
DL	 	I	84	202330	17
DL	 	I	50	202330	20

[Updated on: Fri, 16 February 2024 08:15]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689590 is a reply to message #689589] Fri, 16 February 2024 08:20 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
Are there double quotes in the raw text?
Re: Traverse CLOB with Connect By [message #689591 is a reply to message #689590] Fri, 16 February 2024 09:07 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Yes.  I changed the first line to test if the double quotations would be removed.  I could be wrong but I thought you had said you account for the data having double quotations.  Half way works as you can see.

Data:


campus,complete,ip,program,term,id,
"DL", , ,57,"202330",88,
DL, ,I,84,202330,17,
DL, ,I,50,202330,20,


Result:


"DL		 	57	"202330
DL	 	I	84	202330	17
DL	 	I	50	202330	20

Re: Traverse CLOB with Connect By [message #689592 is a reply to message #689582] Fri, 16 February 2024 09:29 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
I didn't say double quotes are "removed" - just the opposite. I said they are "escaped".

Your test data (at least) didn't show double-quotes as special characters - such as "enclosing characters" for tokens (as you may often see in comma-separated files and such). This means that it's possible double-quotes are literal characters in your data. For example, a NAME value may be Joe "Hit-Me-Twice" Smith. You want the same double-quotes to survive when you insert the data in your final table, you don't want them removed.

JSON uses double-quotes to enclose strings, so literal double-quotes in the actual data must be escaped (preceded by backslash) in JSON strings. That's what JSON_ARRAY does for me in the query; I want double-quotes to be escaped, so that they survive in the final string values, not so that they are removed! Your tests show that that works exactly as intended.
Re: Traverse CLOB with Connect By [message #689593 is a reply to message #689592] Fri, 16 February 2024 10:15 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
You did say that.  My fault.

If what I'm reading is correct, then the result set is not correct.  The result shows only one double-quote (left side) remains in the result set.  Where's the closing double-quote for the right side of the data?

Input Data:  (Double-quotes around "DL" and "202330")


campus,complete,ip,program,term,id,
"DL", , ,57,"202330",88,

Result: (DL has a double-quote on the left side but nothing on the right side.  Same with 202330.  The ID value is also missing for that line.)


"DL		 	57	"202330

All in all, thank you for your help and coming up with this query.  It's what I needed.

[Updated on: Fri, 16 February 2024 10:15]

Report message to a moderator

Re: Traverse CLOB with Connect By [message #689594 is a reply to message #689593] Fri, 16 February 2024 10:23 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
Closing double quote missing because you shortened string lengths in the COLUMNS clause.
Re: Traverse CLOB with Connect By [message #689595 is a reply to message #689594] Fri, 16 February 2024 10:39 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
Ah, gotcha.  Will need to account for that in the data.  Never thought of that.

I think you had mentioned that a lot of CSV files have double-quoted values.  This particular file doesn't.

We're all good then.

Thanks again.
Re: Traverse CLOB with Connect By [message #689596 is a reply to message #689576] Fri, 16 February 2024 20:51 Go to previous messageGo to next message
Duane
Messages: 557
Registered: December 2002
Senior Member
mathguy wrote on Thu, 15 February 2024 19:46


I only show a SELECT statement that splits everything as needed - you can then use this in an INSERT statement.

Are you accounting for the CR/LF with "length(txt) - 2" within the following statement?  Just trying to understand why it's "length(txt) - 2" and not just "length(txt)".

substr(txt, 1, length(txt) - 2)


I just want to make sure I totally understand all the parts of the query.

Re: Traverse CLOB with Connect By [message #689597 is a reply to message #689596] Fri, 16 February 2024 21:54 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
I do that to remove the last CR LF from the text. (If I just left LENGTH(TXT) as the last argument, then I wouldn't need SUBSTR at all - the result would be just TXT!)

There is a meaningful distinction between "separated" and "terminated" lists. For example, each "row" in your data is terminated by CR LF - which means that the last "row" includes the terminal CR LF. Similarly, each "token" in each of your "rows" is terminated by a comma; that means that the last character before CR LF in each "row" is always a comma, indicating the end of the last token.

JSON arrays use comma as a delimiter, not as a terminator. A JSON array consisting of three numbers looks like [3,2,-9] - not [3,2,-9,]. The latter array consists of FOUR tokens, the last one of which is empty, which doesn't really make sense: there is no "missing value" concept in JSON. Strict JSON syntax will flag that as invalid JSON, while lax syntax will allow it by simply ignoring the last comma. JSON has the null value (but it must be spelled out, it can't just be a non-existing token) and it supports empty strings (enclosed in double-quotes!), but not completely empty space. Note though that my "manual" manipulation of the string adds double-quotes around every value; that array would look like ["3","2","-9",""] after my manipulations, and the last token is now an empty string, which is valid.

This is about the inner JSON arrays. With regard to the "global" (outer) JSON array, which represents all the data, I replace each CR LF with (essentially) a comma; more characters to enclose tokens and end and begin inner arrays, but as a separator, CR LF becomes comma in the outer JSON array. Here if I allow the last CR LF to stand, I end up with a mess; this means that after the last nested array (representing a row of real data), if I allow the last CR LF to stand (and to be treated as a separator in the outer array!), there will be an extra nested array at the end, consisting in a single empty string (enclosed in double-quotes). I don't want to have to deal with that - it may throw an error, or end up inserting a row of NULL in the table; bad either way. I want CR LF to be a separator in your CLOB, not a terminator - so I simply chop off the last CR LF from the outset, by leaving out the last two characters of your CLOB.

In more delicate situations you would also have to handle the issue of comma being a token terminator, not separator, in each row; but in this case, you don't, for the reason I explained earlier.
Re: Traverse CLOB with Connect By [message #689598 is a reply to message #689597] Sat, 17 February 2024 11:03 Go to previous message
Duane
Messages: 557
Registered: December 2002
Senior Member
I meant the last CR/LF but didn't specify.  I pulled apart the query returning sections of the data to see what was happening.  I counted out the characters so it appeared that's what you were doing.

Thanks for the explanation.
Previous Topic: Read BLOB from Table into CLOB for further processing
Next Topic: Help needed in Tricky Group by
Goto Forum:
  


Current Time: Sat Apr 27 08:08:22 CDT 2024