Advice needed on file parsing

classic Classic list List threaded Threaded
7 messages Options
Richard llom Richard llom
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Advice needed on file parsing

Hello fellow scilab-users,
I'm writing a script to read and process files, which are constructed as follows:
<file start>
PCB: 007
ASM: 000
LOT: 00000
FW:  1477971088
CH1:  AMPS   10A
CH2:  VOLT   60V
SMPL: 0064 0125Hz
DESC: 12V CU LOG
UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC  
1497812372.910, 8.609146E-03, 1.210613E001
1497812373.895, 1.577809E-01, 1.207540E001
1497812374.578, 1.010268E000, 1.193087E001
... [snip]
<file end>

To process this file further, I need:
1)
the first eight lines stored in pairs, e.g.
info(1,1) should yield "PCB" and info(1,2) should yield "007" (string is ok)

2)
line #9 (header), should be available as header(1)="UTC TIME SEC", etc...

3)
line 10+
these should be scanned in as a matrix.


I already tried csvread and msscanf (?), however with no luck so far...


So if someone could just point me to the apropiates function for each task. I hopefully can take it then from there.
Thanks & cheers
richard
Claus Futtrup Claus Futtrup
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Advice needed on file parsing

Hi Richard

You read the file, first the header and then the matrix, like this:

rtf=mopen(fname,"r");
headr=mgetl(rtf,9);    // reads 9-line header
coords=mfscanf(-1,rtf,"%f %f %f\n");
mclose(fname);

Best regards,
Claus

On 18-06-2017 23:10, Richard llom wrote:
Hello fellow scilab-users,
I'm writing a script to read and process files, which are constructed as
follows:
<file start>
PCB: 007
ASM: 000
LOT: 00000
FW:  1477971088
CH1:  AMPS   10A
CH2:  VOLT   60V
SMPL: 0064 0125Hz
DESC: 12V CU LOG
UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC  
1497812372.910, 8.609146E-03, 1.210613E001
1497812373.895, 1.577809E-01, 1.207540E001
1497812374.578, 1.010268E000, 1.193087E001
... [snip]
<file end>

To process this file further, I need:
1)
the first eight lines stored in pairs, e.g. 
info(1,1) should yield "PCB" and info(1,2) should yield "007" (string is ok)

2)
line #9 (header), should be available as header(1)="UTC TIME SEC", etc...

3)
line 10+
these should be scanned in as a matrix.


I already tried csvread and msscanf (?), however with no luck so far...


So if someone could just point me to the apropiates function for each task.
I hopefully can take it then from there.
Thanks & cheers
richard



--
View this message in context: http://mailinglists.scilab.org/Advice-needed-on-file-parsing-tp4036587.html
Sent from the Scilab users - Mailing Lists Archives mailing list archive at Nabble.com.
_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users



_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
paul.carrico paul.carrico
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Advice needed on file parsing

In reply to this post by Richard llom
Hi
 

I cannot say if the following is the best way to proceed, but when the number of columns differs, I always have a look to such functions in order to get the data: mopen/mgetl/grep/strindex and so on ... it need a bit of work.

The previous method work when the file size is not huge because mgetl loads in memory all the file first - in case of huge files (I mean with millions of lines), I need to adopt another strategy (bash file using awk - grep - seb and so on tool) in order to have a text/matrix file in a right format ... nevertheless I do not get strings so the previous method may not work.

 
 
Just a feedback
 
Paul
 
 
 
 
Le 2017-06-18 23:10, Richard llom a écrit :
Hello fellow scilab-users,
I'm writing a script to read and process files, which are constructed as
follows:
<file start>
PCB: 007
ASM: 000
LOT: 00000
FW:  1477971088
CH1:  AMPS   10A
CH2:  VOLT   60V
SMPL: 0064 0125Hz
DESC: 12V CU LOG
UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC  
1497812372.910, 8.609146E-03, 1.210613E001
1497812373.895, 1.577809E-01, 1.207540E001
1497812374.578, 1.010268E000, 1.193087E001
... [snip]
<file end>

To process this file further, I need:
1)
the first eight lines stored in pairs, e.g.
info(1,1) should yield "PCB" and info(1,2) should yield "007" (string is ok)

2)
line #9 (header), should be available as header(1)="UTC TIME SEC", etc...

3)
line 10+
these should be scanned in as a matrix.


I already tried csvread and msscanf (?), however with no luck so far...


So if someone could just point me to the apropiates function for each task.
I hopefully can take it then from there.
Thanks & cheers
richard



--
View this message in context:
http://mailinglists.scilab.org/Advice-needed-on-file-parsing-tp4036587.html
Sent from the Scilab users - Mailing Lists Archives mailing list
archive at Nabble.com.
_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users

_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Alexx Alexx
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Advice needed on file parsing

Hello,

you may use csvTextScan function where you can pass the CSV separator as parameter. You can reach what you want calling ':' and ',' as separator parameter. As it was said, all lines must have the same number of columns and all columns must have the same number of lines.

It may look like somethings like this :

my_file = mgetl('your_file_path')

first_eight_lines = csvTextScan(my_file(1:8), ':')

header = csvTextScan(my_file(9), ',')     // ',' is set by default

data = csvTextScan(my_file(10:$), ',')

Cheers,
Alexis


Le 19/06/2017 à 08:23, [hidden email] a écrit :
Hi
 

I cannot say if the following is the best way to proceed, but when the number of columns differs, I always have a look to such functions in order to get the data: mopen/mgetl/grep/strindex and so on ... it need a bit of work.

The previous method work when the file size is not huge because mgetl loads in memory all the file first - in case of huge files (I mean with millions of lines), I need to adopt another strategy (bash file using awk - grep - seb and so on tool) in order to have a text/matrix file in a right format ... nevertheless I do not get strings so the previous method may not work.

 
 
Just a feedback
 
Paul
 
 
 
 
Le 2017-06-18 23:10, Richard llom a écrit :
Hello fellow scilab-users,
I'm writing a script to read and process files, which are constructed as
follows:
<file start>
PCB: 007
ASM: 000
LOT: 00000
FW:  1477971088
CH1:  AMPS   10A
CH2:  VOLT   60V
SMPL: 0064 0125Hz
DESC: 12V CU LOG
UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC  
1497812372.910, 8.609146E-03, 1.210613E001
1497812373.895, 1.577809E-01, 1.207540E001
1497812374.578, 1.010268E000, 1.193087E001
... [snip]
<file end>

To process this file further, I need:
1)
the first eight lines stored in pairs, e.g.
info(1,1) should yield "PCB" and info(1,2) should yield "007" (string is ok)

2)
line #9 (header), should be available as header(1)="UTC TIME SEC", etc...

3)
line 10+
these should be scanned in as a matrix.


I already tried csvread and msscanf (?), however with no luck so far...


So if someone could just point me to the apropiates function for each task.
I hopefully can take it then from there.
Thanks & cheers
richard



--
View this message in context:
http://mailinglists.scilab.org/Advice-needed-on-file-parsing-tp4036587.html
Sent from the Scilab users - Mailing Lists Archives mailing list
archive at Nabble.com.
_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Samuel GOUGEON Samuel GOUGEON
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Advice needed on file parsing

In reply to this post by Richard llom
Helle Richard,

Le 18/06/2017 à 23:10, Richard llom a écrit :
Hello fellow scilab-users,
I'm writing a script to read and process files, which are constructed as
follows:
<file start>
PCB: 007
ASM: 000
LOT: 00000
FW:  1477971088
CH1:  AMPS   10A
CH2:  VOLT   60V
SMPL: 0064 0125Hz
DESC: 12V CU LOG
UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC  
1497812372.910, 8.609146E-03, 1.210613E001
1497812373.895, 1.577809E-01, 1.207540E001
1497812374.578, 1.010268E000, 1.193087E001
... [snip]
<file end>

To process this file further, I need:
1)
the first eight lines stored in pairs, e.g. 
info(1,1) should yield "PCB" and info(1,2) should yield "007" (string is ok)

2)
line #9 (header), should be available as header(1)="UTC TIME SEC", etc...

3)
line 10+
these should be scanned in as a matrix.


I already tried csvread and msscanf (?), however with no luck so far...

You may use the following:

[M, comments] = csvRead("data.txt", ",", ".", "double",[], "/^[^0-9\-]/"); M
header = tokens(comments($), ",")'
params = csvTextScan(comments(1:$-1), ":", [], "string")

-->[M, comments] = csvRead("data.txt", ",", ".", "double",[], "/^[^0-9\-]/"); M
 M  =
    1.498D+09    0.0086091    12.10613 
    1.498D+09    0.1577809    12.0754  
    1.498D+09    1.010268     11.93087 
 
-->header = tokens(comments($), ",")'
 header  =
!UTC TIME SEC    CH1 AMPS DC    CH2 VOLT DC    !
 
-->params = csvTextScan(comments(1:$-1), ":", [], "string")
 params  =
!PCB    007          !
!ASM    000          !
!LOT    00000        !
!FW      1477971088  !
!CH1     AMPS   10A  !
!CH2     VOLT   60V  !
!SMPL   0064 0125Hz  !
!DESC   12V CU LOG   !

 HTH
Samuel


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Osvaldo Carvalho Osvaldo Carvalho
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Advice needed on file parsing

In reply to this post by Richard llom
Richard,

Perhaps this may help you:

function
[info, header, m]=parseFile(fileName)
    file_d = mopen(fileName,"r")
    for line = 1:8
        lineToParse = mgetl(file_d,1)
        tks = tokens(lineToParse,":")
        info(line,1) = stripblanks(tks(1))
        info(line,2) = stripblanks(tks(2))
    end
    header = mgetl(file_d,1)
    k = 1
    while ~meof(file_d)
        lineToParse = mgetl(file_d,1)
        tks = tokens(lineToParse,",")
        //pause
        for i = 1:3
            m(k,i) = eval(tks(i))
        end
        k = k + 1
    end
    mclose(file_d)
endfunction


-->[info,header,m] = parseFile("llom.txt")
 m  =
 
    1.498D+09    0.0086091    12.10613 
    1.498D+09    0.1577809    12.0754  
    1.498D+09    1.010268     11.93087 
 header  =
 
 UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC    
 info  =
 
!PCB   007          !
!                   !
!ASM   000          !
!                   !
!LOT   00000        !
!                   !
!FW    1477971088   !
!                   !
!CH1   AMPS   10A   !
!                   !
!CH2   VOLT   60V   !
!                   !
!SMPL  0064 0125Hz  !
!                   !
!DESC  12V CU LOG   !


-----"users" <[hidden email]> escreveu: -----
Para: Users mailing list for Scilab <[hidden email]>
De: [hidden email]
Enviado por: "users"
Data: 19/06/2017 03:25 AM
cc: Richard llom <[hidden email]>
Assunto: Re: [Scilab-users] Advice needed on file parsing

Hi
 

I cannot say if the following is the best way to proceed, but when the number of columns differs, I always have a look to such functions in order to get the data: mopen/mgetl/grep/strindex and so on ... it need a bit of work.

The previous method work when the file size is not huge because mgetl loads in memory all the file first - in case of huge files (I mean with millions of lines), I need to adopt another strategy (bash file using awk - grep - seb and so on tool) in order to have a text/matrix file in a right format ... nevertheless I do not get strings so the previous method may not work.

 
 
Just a feedback
 
Paul
 
 
 
 
Le 2017-06-18 23:10, Richard llom a écrit :
Hello fellow scilab-users,
I'm writing a script to read and process files, which are constructed as
follows:
<file start>
PCB: 007
ASM: 000
LOT: 00000
FW:  1477971088
CH1:  AMPS   10A
CH2:  VOLT   60V
SMPL: 0064 0125Hz
DESC: 12V CU LOG
UTC TIME SEC  ,CH1 AMPS DC  ,CH2 VOLT DC  
1497812372.910, 8.609146E-03, 1.210613E001
1497812373.895, 1.577809E-01, 1.207540E001
1497812374.578, 1.010268E000, 1.193087E001
... [snip]
<file end>

To process this file further, I need:
1)
the first eight lines stored in pairs, e.g.
info(1,1) should yield "PCB" and info(1,2) should yield "007" (string is ok)

2)
line #9 (header), should be available as header(1)="UTC TIME SEC", etc...

3)
line 10+
these should be scanned in as a matrix.


I already tried csvread and msscanf (?), however with no luck so far...


So if someone could just point me to the apropiates function for each task.
I hopefully can take it then from there.
Thanks & cheers
richard



--
View this message in context:
http://mailinglists.scilab.org/Advice-needed-on-file-parsing-tp4036587.html
Sent from the Scilab users - Mailing Lists Archives mailing list
archive at Nabble.com.
_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users

_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Richard llom Richard llom
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Advice needed on file parsing

In reply to this post by Richard llom
Hello All,
thank you all for the quick and numerous replies!

I went with Samuels version.
However while trying to understand the syntax and looking at the help:
https://help.scilab.org/docs/6.0.0/en_US/csvRead.html
I stumbled over:
regexpcomments
    a string: a regexp to remove lines which match. (default: [])

which I found misleading / incomplete description.

I suggest instead (or similar)

a string: a regexp to match lines to 'comments' or to be ignored if 'comments' is omitted. (default: [])


cheers
richard
Loading...