[Scilab-users] parsing TSV (or CSV) file with scilab is a nightmare

classic Classic list List threaded Threaded
8 messages Options
Antoine Monmayrant Antoine Monmayrant
Reply | Threaded
Open this post in threaded view
|

[Scilab-users] parsing TSV (or CSV) file with scilab is a nightmare

Hi all,


This is both a rant and desperate cry for help.
I'm trying to parse some TSV data (tab separated data file) with scilab and I cannot find a way to navigate around the minefield of bugs present in meof/mgetl/mgetstr/csvRead.

A bit of context: I need to load into scilab data generated by a closed source software.
The data is in the form of many TSV files (that I cannot share in full, just some redacted bits) with a header and a footer.
I don't want to hand modify these files or edit them in any way (I need to keep this as portable as possible, so no sed/awk/grep...)

OPTION 1: csvRead

That's the most intuitive solution, however, because of http://bugzilla.scilab.org/show_bug.cgi?id=16391 and the presence of more than 1 empty line in my header/footer, this crashes Scilab.

OPTION 2: hand parsing line by line using mgetl/meof

I tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtline=mgetl(fd,1);
end
mclose(fd)

Saddly, and contrary to what's written in "help mgetl", meof keeps on returning 0, well passed the end of the file and the while never ends!

OPTION 3: hand parsing chunk by chunk using mgetstr/meof

"help meof" does not confirm that meof should work with mgetl, but mgetstr is specifically listed.
I thus tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtchunk=mgetstr(80,fd);
end
mclose(fd)

But thanks to http://bugzilla.scilab.org/show_bug.cgi?id=16419 this is also crashing Scilab.


OPTION 4: Can anyone here help me with this?

I am really running out of ideas.
Did I miss some -hmm- obvious combination of available file parsing scilab functions to achieve my goal?
I have the feeling that it would have been faster for me to just learn a totally new language that does not suck at parsing files than trying to get it to work with scilab....


Antoine

(depressed)



http://bugzilla.scilab.org/show_bug.cgi?id=16419


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
David Chèze David Chèze
Reply | Threaded
Open this post in threaded view
|

Re: parsing TSV (or CSV) file with scilab is a nightmare

Hi Antoine,

 

did you also look at fscanfMat ? It's handy when  space or tab separators.

 

regards,

 

David


De : users [[hidden email]] de la part de Antoine Monmayrant [[hidden email]]
Envoyé : lundi 27 avril 2020 17:40
À : Users mailing list for Scilab
Objet : [Scilab-users] parsing TSV (or CSV) file with scilab is a nightmare

Hi all,


This is both a rant and desperate cry for help.
I'm trying to parse some TSV data (tab separated data file) with scilab and I cannot find a way to navigate around the minefield of bugs present in meof/mgetl/mgetstr/csvRead.

A bit of context: I need to load into scilab data generated by a closed source software.
The data is in the form of many TSV files (that I cannot share in full, just some redacted bits) with a header and a footer.
I don't want to hand modify these files or edit them in any way (I need to keep this as portable as possible, so no sed/awk/grep...)

OPTION 1: csvRead

That's the most intuitive solution, however, because of http://bugzilla.scilab.org/show_bug.cgi?id=16391 and the presence of more than 1 empty line in my header/footer, this crashes Scilab.

OPTION 2: hand parsing line by line using mgetl/meof

I tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtline=mgetl(fd,1);
end
mclose(fd)

Saddly, and contrary to what's written in "help mgetl", meof keeps on returning 0, well passed the end of the file and the while never ends!

OPTION 3: hand parsing chunk by chunk using mgetstr/meof

"help meof" does not confirm that meof should work with mgetl, but mgetstr is specifically listed.
I thus tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtchunk=mgetstr(80,fd);
end
mclose(fd)

But thanks to http://bugzilla.scilab.org/show_bug.cgi?id=16419 this is also crashing Scilab.


OPTION 4: Can anyone here help me with this?

I am really running out of ideas.
Did I miss some -hmm- obvious combination of available file parsing scilab functions to achieve my goal?
I have the feeling that it would have been faster for me to just learn a totally new language that does not suck at parsing files than trying to get it to work with scilab....


Antoine

(depressed)



http://bugzilla.scilab.org/show_bug.cgi?id=16419


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
aweeks aweeks
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] parsing TSV (or CSV) file with scilab is a nightmare

In reply to this post by Antoine Monmayrant

Hi Antoine,

 

I often have to read csv files with odd lines that trip functions like csvRead so I often use the method below.  It may solve your problem.

 

                dataread = mgetl(readfile);                                                         // Read everything

                a = [];

                b = [];

                …

                for i = 1: size(dataread, 'r') do

                                line = dataread(i);

                                if length(line) ~= 0 then                                                 // Ignore blank lines

                                                line = tokens(line, [' ', ',', ascii(9)]);            // Accept spaces, commas or tabs

                                                if and(isnum(line)) then                                // If the line is all-numeric

                                                                line = strtod(line);

                                                                a = [a; line(1)];

                                                                b = [b; line(2)];

                                                                …

                                                end

                                end

                end

 

 

Adrian Weeks
Development Engineer, Hardware Engineering EMEA
Office: +44 (0)2920 528500 | Desk: +44 (0)2920 528523 | Fax: +44 (0)2920 520178
[hidden email]

HID Global Logo

Unit 3, Cae Gwyrdd,
Green meadow Springs,
Cardiff, UK,
CF15 7AB.
www.hidglobal.com

 

 

From: users <[hidden email]> On Behalf Of Antoine Monmayrant
Sent: 27 April 2020 16:41
To: Users mailing list for Scilab <[hidden email]>
Subject: [EXT] [Scilab-users] parsing TSV (or CSV) file with scilab is a nightmare

 

*** Please use caution this is an externally originating email. ***

Hi all,

 

This is both a rant and desperate cry for help.
I'm trying to parse some TSV data (tab separated data file) with scilab and I cannot find a way to navigate around the minefield of bugs present in meof/mgetl/mgetstr/csvRead.

A bit of context: I need to load into scilab data generated by a closed source software.
The data is in the form of many TSV files (that I cannot share in full, just some redacted bits) with a header and a footer.
I don't want to hand modify these files or edit them in any way (I need to keep this as portable as possible, so no sed/awk/grep...)

OPTION 1: csvRead

That's the most intuitive solution, however, because of http://bugzilla.scilab.org/show_bug.cgi?id=16391 and the presence of more than 1 empty line in my header/footer, this crashes Scilab.

OPTION 2: hand parsing line by line using mgetl/meof

I tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtline=mgetl(fd,1);
end
mclose(fd)

Saddly, and contrary to what's written in "help mgetl", meof keeps on returning 0, well passed the end of the file and the while never ends!

OPTION 3: hand parsing chunk by chunk using mgetstr/meof

"help meof" does not confirm that meof should work with mgetl, but mgetstr is specifically listed.
I thus tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtchunk=mgetstr(80,fd);
end
mclose(fd)

But thanks to http://bugzilla.scilab.org/show_bug.cgi?id=16419 this is also crashing Scilab.

 

OPTION 4: Can anyone here help me with this?

I am really running out of ideas.
Did I miss some -hmm- obvious combination of available file parsing scilab functions to achieve my goal?
I have the feeling that it would have been faster for me to just learn a totally new language that does not suck at parsing files than trying to get it to work with scilab....

 

Antoine

(depressed)

 

 

http://bugzilla.scilab.org/show_bug.cgi?id=16419


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
JLan JLan
Reply | Threaded
Open this post in threaded view
|

Re: parsing TSV (or CSV) file with scilab is a nightmare

In reply to this post by Antoine Monmayrant

Antoine

To find out how long the file is (although not strictly necessary) I normally use:

fid = mopen(datafile,'rb');
mseek(0,fid,'end');
lef=mtell(fid)
mseek(0,fid);

Then you can read in the whole file byte by byte (or split it up if it is big) :

data=mgeti(lef,'c',fid);

The rest is just looking for the different letters and sort based on that. 

Jan


On 2020-04-27 17:40 PM, Antoine Monmayrant wrote:

Hi all,


This is both a rant and desperate cry for help.
I'm trying to parse some TSV data (tab separated data file) with scilab and I cannot find a way to navigate around the minefield of bugs present in meof/mgetl/mgetstr/csvRead.

A bit of context: I need to load into scilab data generated by a closed source software.
The data is in the form of many TSV files (that I cannot share in full, just some redacted bits) with a header and a footer.
I don't want to hand modify these files or edit them in any way (I need to keep this as portable as possible, so no sed/awk/grep...)

OPTION 1: csvRead

That's the most intuitive solution, however, because of http://bugzilla.scilab.org/show_bug.cgi?id=16391 and the presence of more than 1 empty line in my header/footer, this crashes Scilab.

OPTION 2: hand parsing line by line using mgetl/meof

I tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtline=mgetl(fd,1);
end
mclose(fd)

Saddly, and contrary to what's written in "help mgetl", meof keeps on returning 0, well passed the end of the file and the while never ends!

OPTION 3: hand parsing chunk by chunk using mgetstr/meof

"help meof" does not confirm that meof should work with mgetl, but mgetstr is specifically listed.
I thus tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtchunk=mgetstr(80,fd);
end
mclose(fd)

But thanks to http://bugzilla.scilab.org/show_bug.cgi?id=16419 this is also crashing Scilab.


OPTION 4: Can anyone here help me with this?

I am really running out of ideas.
Did I miss some -hmm- obvious combination of available file parsing scilab functions to achieve my goal?
I have the feeling that it would have been faster for me to just learn a totally new language that does not suck at parsing files than trying to get it to work with scilab....


Antoine

(depressed)



http://bugzilla.scilab.org/show_bug.cgi?id=16419


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users

_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Antoine Monmayrant Antoine Monmayrant
Reply | Threaded
Open this post in threaded view
|

Re: parsing TSV (or CSV) file with scilab is a nightmare

In reply to this post by David Chèze

Hello David,


Thanks.
No I did not have a look at fscanfMat, as I had in mind to also import the header and footer.

Samuel and Jan also proposed to simply use "mgetl(fd)" to grab the whole file at once.
Their solution seems to work (or at least does not crash scilab on the first file I tested!).


Thank you all for your kind help,


Antoine



On 27/04/2020 18:06, CHEZE David 227480 wrote:

Hi Antoine,

 

did you also look at fscanfMat ? It's handy when  space or tab separators.

 

regards,

 

David


De : users [[hidden email]] de la part de Antoine Monmayrant [[hidden email]]
Envoyé : lundi 27 avril 2020 17:40
À : Users mailing list for Scilab
Objet : [Scilab-users] parsing TSV (or CSV) file with scilab is a nightmare

Hi all,


This is both a rant and desperate cry for help.
I'm trying to parse some TSV data (tab separated data file) with scilab and I cannot find a way to navigate around the minefield of bugs present in meof/mgetl/mgetstr/csvRead.

A bit of context: I need to load into scilab data generated by a closed source software.
The data is in the form of many TSV files (that I cannot share in full, just some redacted bits) with a header and a footer.
I don't want to hand modify these files or edit them in any way (I need to keep this as portable as possible, so no sed/awk/grep...)

OPTION 1: csvRead

That's the most intuitive solution, however, because of http://bugzilla.scilab.org/show_bug.cgi?id=16391 and the presence of more than 1 empty line in my header/footer, this crashes Scilab.

OPTION 2: hand parsing line by line using mgetl/meof

I tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtline=mgetl(fd,1);
end
mclose(fd)

Saddly, and contrary to what's written in "help mgetl", meof keeps on returning 0, well passed the end of the file and the while never ends!

OPTION 3: hand parsing chunk by chunk using mgetstr/meof

"help meof" does not confirm that meof should work with mgetl, but mgetstr is specifically listed.
I thus tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtchunk=mgetstr(80,fd);
end
mclose(fd)

But thanks to http://bugzilla.scilab.org/show_bug.cgi?id=16419 this is also crashing Scilab.


OPTION 4: Can anyone here help me with this?

I am really running out of ideas.
Did I miss some -hmm- obvious combination of available file parsing scilab functions to achieve my goal?
I have the feeling that it would have been faster for me to just learn a totally new language that does not suck at parsing files than trying to get it to work with scilab....


Antoine

(depressed)



http://bugzilla.scilab.org/show_bug.cgi?id=16419


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users

_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Antoine Monmayrant Antoine Monmayrant
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] parsing TSV (or CSV) file with scilab is a nightmare

In reply to this post by aweeks

Hello Adrian,


In essence, your extremely useful solution is similar to what Samuel and Jan proposed: grab the whole file once.
I must admit I did not even consider it given the length of the files involved and how easily I managed to crash scilab on small files.


Thanks,


Antoine

On 27/04/2020 18:58, Adrian Weeks wrote:

Hi Antoine,

 

I often have to read csv files with odd lines that trip functions like csvRead so I often use the method below.  It may solve your problem.

 

                dataread = mgetl(readfile);                                                         // Read everything

                a = [];

                b = [];

                …

                for i = 1: size(dataread, 'r') do

                                line = dataread(i);

                                if length(line) ~= 0 then                                                 // Ignore blank lines

                                                line = tokens(line, [' ', ',', ascii(9)]);            // Accept spaces, commas or tabs

                                                if and(isnum(line)) then                                // If the line is all-numeric

                                                                line = strtod(line);

                                                                a = [a; line(1)];

                                                                b = [b; line(2)];

                                                                …

                                                end

                                end

                end

 

 

Adrian Weeks
Development Engineer, Hardware Engineering EMEA
Office: +44 (0)2920 528500 | Desk: +44 (0)2920 528523 | Fax: +44 (0)2920 520178
[hidden email]

HID Global Logo

Unit 3, Cae Gwyrdd,
Green meadow Springs,
Cardiff, UK,
CF15 7AB.
www.hidglobal.com

 

 

From: users [hidden email] On Behalf Of Antoine Monmayrant
Sent: 27 April 2020 16:41
To: Users mailing list for Scilab [hidden email]
Subject: [EXT] [Scilab-users] parsing TSV (or CSV) file with scilab is a nightmare

 

*** Please use caution this is an externally originating email. ***

Hi all,

 

This is both a rant and desperate cry for help.
I'm trying to parse some TSV data (tab separated data file) with scilab and I cannot find a way to navigate around the minefield of bugs present in meof/mgetl/mgetstr/csvRead.

A bit of context: I need to load into scilab data generated by a closed source software.
The data is in the form of many TSV files (that I cannot share in full, just some redacted bits) with a header and a footer.
I don't want to hand modify these files or edit them in any way (I need to keep this as portable as possible, so no sed/awk/grep...)

OPTION 1: csvRead

That's the most intuitive solution, however, because of http://bugzilla.scilab.org/show_bug.cgi?id=16391 and the presence of more than 1 empty line in my header/footer, this crashes Scilab.

OPTION 2: hand parsing line by line using mgetl/meof

I tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtline=mgetl(fd,1);
end
mclose(fd)

Saddly, and contrary to what's written in "help mgetl", meof keeps on returning 0, well passed the end of the file and the while never ends!

OPTION 3: hand parsing chunk by chunk using mgetstr/meof

"help meof" does not confirm that meof should work with mgetl, but mgetstr is specifically listed.
I thus tried:

filename="tsv.txt";
[fd, err] = mopen(filename, 'rt');
while ~meof(fd) do
    txtchunk=mgetstr(80,fd);
end
mclose(fd)

But thanks to http://bugzilla.scilab.org/show_bug.cgi?id=16419 this is also crashing Scilab.

 

OPTION 4: Can anyone here help me with this?

I am really running out of ideas.
Did I miss some -hmm- obvious combination of available file parsing scilab functions to achieve my goal?
I have the feeling that it would have been faster for me to just learn a totally new language that does not suck at parsing files than trying to get it to work with scilab....

 

Antoine

(depressed)

 

 

http://bugzilla.scilab.org/show_bug.cgi?id=16419


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users

_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
Rafael Guerra Rafael Guerra
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] parsing TSV (or CSV) file with scilab is a nightmare

Antoine,

 

One workflow that works fast for me, for large data files, is to load first the whole file with mgetl, then remove all empty lines using isempty in a loop (as shown below), process the header block, isolate the data block and save it to a temporary backup file to disk using mputl, then load very efficiently from disk that backup file using fscanfMat.

 

tlines= mgetl(fid,-1);  // reads lines until end of file into 1 column text vector

bool= ~cellfun(isempty,tlines);

tlines= tlines(bool);    // removes empty lines

 

function out_text=cellfun(fun, in_text)

// Applies function to input text (column strings vector), line by line

  n=size(in_text,1);

  for i=1:n;

     out_text(i)=fun(in_text(i));

  end

endfunction

 

 

Regards,

Rafael


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
JLan JLan
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] parsing TSV (or CSV) file with scilab

I find it safer to process the data without returning to a disk file. As mentioned I actually prefer to start with mgeti() and read the file as binary, as then all byte values are accepted.

But anyway with the data separated in lines, it is relatively simple to split up with the wanted separators and decimal sign :

clear dataset;
headerlines=3:
footerlines=2:
for k=1:size(in_text,1)
    if k>headerlines && k<n-footerlines then
       datatemp=strtod(strsplit(in_text(k),[ascii(9),";"]),",");
       dataset(k-headerlines,1:length(datatemp))=datatemp;
    end
end

disp(in_text(1:headerlines));
disp(dataset);
disp(in_text(($-footerlines+1):$));



On 2020-04-28 10:14 AM, Rafael Guerra wrote:

Antoine,

 

One workflow that works fast for me, for large data files, is to load first the whole file with mgetl, then remove all empty lines using isempty in a loop (as shown below), process the header block, isolate the data block and save it to a temporary backup file to disk using mputl, then load very efficiently from disk that backup file using fscanfMat.

 

tlines= mgetl(fid,-1);  // reads lines until end of file into 1 column text vector

bool= ~cellfun(isempty,tlines);

tlines= tlines(bool);    // removes empty lines

 

function out_text=cellfun(fun, in_text)

// Applies function to input text (column strings vector), line by line

  n=size(in_text,1);

  for i=1:n;

     out_text(i)=fun(in_text(i));

  end

endfunction

 

 

Regards,

Rafael


_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users

_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users