C/C++ Convert Manual Text to HTML format?

OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

pandoc is the best tool to convert text documents

pandoc install

Bash:
# pkg install pandoc
pandoc is written in Haskell and does require about 1.5 gb of space to install
but its worth it
I have an impression that 1.5 gb can be converted in less than 50-100 lines just in c.
 

NapoleonWils0n

Active Member

Reaction score: 49
Messages: 151

I have an impression that 1.5 gb can be converted in less than 50-100 lines just in c.
on freebsd pandoc installs a lot of Haskell stuff which takes up a lot of space,
whereas on linux there is a pandoc binary which is much, much smalller
 

ralphbsz

Daemon

Reaction score: 1,167
Messages: 1,879

I want to see you parse the file format, recognize the structure (headings, lists, ...) in 50 lines of C. That language is a very bad choice for this job. If you are interested in doing it as a teaching exercise, why don't you use assembly instead? It would be even more difficult.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

i considered maybe that why not
I want to see you parse the file format, recognize the structure (headings, lists, ...) in 50 lines of C. That language is a very bad choice for this job. If you are interested in doing it as a teaching exercise, why don't you use assembly instead? It would be even more difficult.
likely the best choice, because you can run it on anything.

ex or even ed was C.
 

ralphbsz

Daemon

Reaction score: 1,167
Messages: 1,879

Have you ever tried to run C on an 8-bit micro, with less than 64K address space, and no FPU and no 32-bit variables and only limited support for 16-bit arithmetic? I have (on a Z80). The notion that you can run C anywhere is mostly false, but contains a small grain of truth: the basic syntax of C will work on many computers, from 8-bit micros to the largest supercomputers. But it is very difficult to write a program that survives porting outside the comfort zone of 32- and 64-bit machines of today. To begin with, the standard C libraries that we use today require so much memory that they simply do not run on everything. Even today, writing C code that is correct on both 32- and 64-bit and both big- and little-endian systems is quite difficult, and requires great care. For example, if you write your code for this problem on an i386, I'm sure it will break on a Power9, unless you have training and experience in writing portable code.

And in practice: Can you give me an example of a computer that runs modern C (not a highly restricted subset for embedded systems), but is not capable of running python or perl?

The really big problem with your idea is this. The task you are proposing is to parse a text input. The parsing consists of finding large-scale structure, such as headings, subheadings, indented lists, description lists, and so on. Then turn that structure back into a markup language, such as HTML. This is a form of text processing. Pure C is remarkably bad at text processing. Why? The single biggest impediment is that C has no memory management built in. Everytime you work with a string, you (as the programmer) have to take care of of allocating memory for the string, knowing ahead of time how much memory needs to be allocated, keeping track of whether the memory is still needed, and releasing it afterwards. In a typical string-handling C program half the code and 90% of the bugs are about memory management. This is a terribly inefficient way to solve this problem; inefficient both at runtime (all the code that humans write to keep track of which memory is needed when is probably not as good as what a well-designed programming language could do), and inefficient in programmer time, in particular when you consider bugs in the code.

You want to do string handling? Use an appropriate language. One where the string is a first-class data type (and not just syntactic sugar around an array of bytes, which by a strange coincidence terminated by an in-band character). One where memory management is automatic and efficient. One where data structures (such as lists, trees, dictionaries, hashmaps, ...) are immediately available, in efficient and bug-free versions.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

Have you ever tried to run C on an 8-bit micro, with less than 64K address space, and no FPU and no 32-bit variables and only limited support for 16-bit arithmetic? I have (on a Z80). The notion that you can run C anywhere is mostly false, but contains a small grain of truth: the basic syntax of C will work on many computers, from 8-bit micros to the largest supercomputers. But it is very difficult to write a program that survives porting outside the comfort zone of 32- and 64-bit machines of today. To begin with, the standard C libraries that we use today require so much memory that they simply do not run on everything. Even today, writing C code that is correct on both 32- and 64-bit and both big- and little-endian systems is quite difficult, and requires great care. For example, if you write your code for this problem on an i386, I'm sure it will break on a Power9, unless you have training and experience in writing portable code.

And in practice: Can you give me an example of a computer that runs modern C (not a highly restricted subset for embedded systems), but is not capable of running python or perl?

The really big problem with your idea is this. The task you are proposing is to parse a text input. The parsing consists of finding large-scale structure, such as headings, subheadings, indented lists, description lists, and so on. Then turn that structure back into a markup language, such as HTML. This is a form of text processing. Pure C is remarkably bad at text processing. Why? The single biggest impediment is that C has no memory management built in. Everytime you work with a string, you (as the programmer) have to take care of of allocating memory for the string, knowing ahead of time how much memory needs to be allocated, keeping track of whether the memory is still needed, and releasing it afterwards. In a typical string-handling C program half the code and 90% of the bugs are about memory management. This is a terribly inefficient way to solve this problem; inefficient both at runtime (all the code that humans write to keep track of which memory is needed when is probably not as good as what a well-designed programming language could do), and inefficient in programmer time, in particular when you consider bugs in the code.

You want to do string handling? Use an appropriate language. One where the string is a first-class data type (and not just syntactic sugar around an array of bytes, which by a strange coincidence terminated by an in-band character). One where memory management is automatic and efficient. One where data structures (such as lists, trees, dictionaries, hashmaps, ...) are immediately available, in efficient and bug-free versions.
It is actually the way it is with C.
It is long and hard to deal with a problem, without seg. fault.

Actually it offer however a faster code, which runs in all way with taking less ressource than perl, python,... ruby.
 

ralphbsz

Daemon

Reaction score: 1,167
Messages: 1,879

Have you measured the "less resource" thing? Can you offer some data as proof? Don't tell me "everyone knows", because those statements tend to be nonsense.

I did. This was in the late 90s, when Java was all the craze. Our development group wanted to use Java for the next major project (it ended up several million LOC), and engineering management said that "Java is too slow, we'll burn too much CPU time". So our group took a sample piece of code (a pretty complex image correlation, which had been hand-optimized in C++), and measured it with 4 different runtime environments: MS Visual C++ (which we thought would be slow, since "everyone knows" that Microsoft's compiler makes bloated code), Waterloo C++ (since "everyone knows" that the Waterloo and Portland compilers have the best code generators), the Symantec Java JIT (in the hope that just-in-time compilation would help Java look a tiny bit better, but we expected this one to come in last), and an experimental Java-to-native-instructions compiler (bypassing the JVM!) which we had received from a research group. The results were totally unexpected: they were exactly backwards. The fastest was Java with the JIT, then came Microsoft, then came the Waterloo C++ compiler, and the experimental Java compiler was slowest.

So before you tell me that something uses less resource, please show me the data.

Also, go back to your CS101 class. There you learned that the constant part in front of an algorithm's runtime is of minor importance. The important part is the order or exponent of the algorithm. If using a better runtime environment allows you to implement a better solution, which has less combinatorial complexity, that will always dwarf the factor: Even if python did indeed run 50% slower than C++ (which is something I don't even believe), then if using a data structure that's available in python and that allows you to reduce your algorithm from O(n^3) to O(n^2) will save you a factor of 1000 for a 1000-line input file. And given that the solution to finding large-scale structure in a text file will have to look at multiple lines at a time, the runtime of your algorithm is likely to be way higher than O(n), and such algorithmic improvements are where the real art of programming lies.

Furthermore, for a program like this, the actual runtime is a tiny part of the overall resource usage. Once you have this thing developed, it will run in a fraction of a second for input files of reasonable length. Unless this piece of code becomes a central part of the processing that Facebook uses, or that the NSA uses, or it is part of something that does things like weather forecasting on supercomputers, the resource consumption at runtime is perhaps CPU-seconds, CPU-minutes if you have many files to convert. But to develop it, you will take weeks. During that, you will be editing, compiling, linking, and testing. That will use CPU-weeks. You are focused completely on the wrong thing, even if you want to restrict your analysis to purely energy consumption (CO2 output) of the project. The real metric you should use is: Your time has value (you could be doing something useful to save the planet in the extra weeks), and with a good development environment, you can get done faster, write a more efficient program, and come up with something of higher quality.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

CPU-seconds, CPU-minutes is actually a good concept.

It depends, if the developer starts with an already available set of functions for string operations, or not. Then, it can eventually be done faster than other high level language.
 

ralphbsz

Daemon

Reaction score: 1,167
Messages: 1,879

Suggestion: Why don't you write here a pseudocode skeleton of the program you want to write? Here is a starting point:
  • Read all lines of the input file into memory.
  • Assume that all input is US-ASCII, with line length <80 characters, no embedded NUL or other control characters, and all lines terminated by NL only.
  • Blank lines have zero length, only the trailing NL.
  • Break the input into paragraphs.
  • Assume that the first line of the file is a single-line paragraph, and use it as the title and <H1> field.
  • Recognize header lines as a 2-line paragraph, both lines the same length, second line all dashes. Make those into <H2>
  • ... deal with lists, descriptions, and so on.
If you do that, then we can start poking holes at it, and show you the real complexity of the task you are about to try. For example, each of the above things I wrote needs 5 or 10 extra sentences to explain what really happens.

Once you go a little bit deeper into it, you will find that this is actually very complicated.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

Yeah, likely it will be available.

Java for simple tasks, well, seems indeed as above "slow", or locked to another part of software.
You need to deal with installation of java on ms, apple,...
With C, you can readily run it on any BSD and any Linux (maybe except ubuntu), without changing the source code. It may run anytime, when you need it, and on a bare-bone clean installation. Fast and reliable. On some BSD, clang can be easily installed. So, easy to compile anything and anywhere. This task can be made even over a distant ssh, without a desktop. It is actually sort of freedom.
 

Bobi B.

Well-Known Member

Reaction score: 142
Messages: 345

Any particular reason on why writing yet another tool to repeat something others have already done?

And how often you plan to use this tool? Are existing tools really that bad, to need a new one?
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

I bet it's already written. Must be this one.
this was experimental, very first, brrr, this project could be removed from git actually. There are many improvements, about 10 millions ;)

Plan is to be used only once for experiment and learning curve.
Learning C is good. The aim is only focused on learning C. Why? Because it is a nice programming language.

But yeah, there are cool example on Unix. In the past, D. Ritchie used C for many things, when Perl, python, java, ruby,... did not exist yet.

Ex. (1): https://www.tuhs.org/cgi-bin/utree.pl?file=V5/usr/source/s2/wc.c

Ex. (2) An example of editor processing strings: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V6/usr/source/s1/ed.c
which has not many lines of code.
ed in java would interest me to see how many lines.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

+ programming in C without graphical interface allows to have a complete freedom, which leaves chance to work in a SSH connection with terminal.
Alternatively X11 + xterm is as well possible.
Emacs, notepad, geany, notepad++ wine, ... geany, vim with/without x-terminal, mcedit, ee, ed, gvim, leafpad, ... kate, ... or any editor you would like.
Finally, it gives possibility too to use the application with an additional frontend, if desired (FLTK, GTK,...).

Basically, it offers chance to have free, software independence, and the software is then in your hands. It is not like using MS Windows, where no one has chance to decide how will be used the software.

It is much faster to code within few minutes, 5 to 10 min maximum something rather than looking for hours for a solution of someone, which will not be ideally what you are looking for. It is really very fast.

Because clang is by default available on BSD, then, your program can run easily without need of large, CPU consuming, installation of numerous (eg. graphical) libraries and packages.

If you have versatile, ready-to-use libraries, that you like, it is ultra fast to get something done - in an human readable markup - language.

Note, well, here, freedom does not mean GNU necessarily, but it is not related to GNU. GNU has some drawbacks, somehow.
 

hruodr

Well-Known Member

Reaction score: 46
Messages: 362

which method or tool would you recommend
I would write a C program that adds at the beginning of the text:

<html><head></head><body><pre>

and at the end

</pre></body></html>

Would that not be "regular html"?

ralphbsz, perhaps you manage to do it in less than 50-100 lines in C.

If you mean, you must pass some of the original structure of the text to the output,
then you must begin recognizing that structure and then find an appropriate tool
for it, perhaps lex and yacc.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

Thank you for your post. I appreciate.

I believe I made about 12-18 years ago this C exr.
the teachers gave us basic examples to start with. I started to work on this exr.

Code:
    #include <stdio.h>
    #include <stdlib.h>

    int main()
    {
        int c ;
        c = getchar();
        while( c != EOF )
        {
           putchar( c );
           c = getchar();
        }
    return 0;
    }
I keep you up to date about progress.

Basically the top line of html (version,... ) is not necessary. Only for some specific use.
The body, head, ... could be not taken into account. It makes the html page look nicer.
elinks --force-html can handle it is this first top line is not present.
 

ralphbsz

Daemon

Reaction score: 1,167
Messages: 1,879

+ programming in C without graphical interface allows to have a complete freedom,
What do you mean by complete freedom? That you can write a program that can do anything you want, as long as it has text input and output? Well, you can do that in nearly all programming languages. At least the ones that are colloquially (but wrongly) called "turing complete". A good computer science theorist (which I am not) can probably create a formal proof that any algorithm that converts text into text and is written in C and that terminates can also be implemented in ruby, rexx, perl, python, or awk.

... which leaves chance to work in a SSH connection with terminal.
Alternatively X11 + xterm is as well possible.
Which has exactly nothing to do with the choice of programming language. You can write command-line programs in a huge variety of languages. For this particular problem, C is one of the least suitable languages (although Fortran-IV or assembly would be admittedly even less suitable).

Finally, it gives possibility too to use the application with an additional frontend, if desired (FLTK, GTK,...).
Absolutely. That is true of any program that is a pure command-line program. Even for perl and python programs, which you seem to so dislike.

Basically, it offers chance to have free, software independence, and the software is then in your hands. It is not like using MS Windows, where no one has chance to decide how will be used the software.
Sorry, but the question here (how to implement a text -> text converter that reads a certain documentation format and outputs HTML format) has absolutely nothing to do with free. You can do it with free software, or with licensed paid software. You can create free software, or you can assign the rights to the software to someone else, or you can license and sell it yourself. You can do all of this on MS Windows (yes, there are free C compilers for Windows), and on Unixes. Sorry, but your raving about Windows and freedom has NOTHING to do with the question here.

It is much faster to code within few minutes, 5 to 10 min maximum something rather than looking for hours for a solution of someone, which will not be ideally what you are looking for. It is really very fast.
I want to see you code this in 5-10 minutes. Really. It would take a seasoned professional hours to come up with a quick solution that works correctly for a few files, and weeks or months to come up with a general solution that is high enough quality to release it to a wide audience. I am sorry, but you have completely lost touch with reality.

Because clang is by default available on BSD, then, ...
Many other languages are also available on BSD. For example awk. This problem could be coded in awk (and it wouldn't even be a particularly bad idea, way better than C). And awk is most certainly in the default installation. To be honest, I don't know whether python and perl are in the default installation (I install them anyway, since they are needed for any computer that can do realistic work).

your program can run easily without need of large, CPU consuming, installation of numerous (eg. graphical) libraries and packages.
That's complete poppycock. Installing programming languages such as the ones I keep mentioning is
  • not large (the packages are fast to download and install, and only increase the OS footprint by a very small amount),
  • it is not CPU consuming (you keep spewing that nonsense, and when called on the carpet to produce some evidence you go strangely silent),
  • not numerous (the base perl or python package is probably just one or a handful packages),
  • and has nothing to do with graphical. All these programming languages exist in basic, non-graphical versions.
If you have versatile, ready-to-use libraries, that you like, it is ultra fast to get something done
Absolutely. If you can find a ready-to-use library that can parse the documentation format you showed above, then the problem you posed here is nearly solved. Alas, I don't know of such a library. You will have to write it yourself. Which is what I'm trying to get you to start, or at least explain how you would start.

- in an human readable markup - language.
Well-written code in nearly any programming language is human readable. Even in perl (although admittedly, the standard perl style is a little hard to follow). I know there are exceptions, languages where code is nearly unreadable, some that were used for serious work (such as APL), and some that were jokes (such as Intercal).

Note, well, here, freedom does not mean GNU necessarily, but it is not related to GNU. GNU has some drawbacks, somehow.
None of this has anything to do with GNU, or freedom. You posed a programming question. Deal with it and start programming. I've given quite a few pointers in this thread, about how to select programming languages, how to think about structuring your program, and so on. Please stop your lunatic raving about Windows, GNU, freedom, and simplicity, and start using a computer.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

What do you mean by complete freedom? That you can write a program that can do anything you want, as long as it has text input and output? Well, you can do that in nearly all programming languages. At least the ones that are colloquially (but wrongly) called "turing complete". A good computer science theorist (which I am not) can probably create a formal proof that any algorithm that converts text into text and is written in C and that terminates can also be implemented in ruby, rexx, perl, python, or awk.

Surely, we have freedom now and also a wide range of possibilities concerning choice of programming language. Why not in assembler, by the way for training skills ? Or Java for the other way.
Both are good choices.


Which has exactly nothing to do with the choice of programming language. You can write command-line programs in a huge variety of languages. For this particular problem, C is one of the least suitable languages (although Fortran-IV or assembly would be admittedly even less suitable).

Sure. Because C is really down to minimum.
However it is a cool, good, one.
You can do it in Pascal, it is a good training too.


Absolutely. That is true of any program that is a pure command-line program. Even for perl and python programs, which you seem to so dislike.
I want the program to be fairly compilable without worries, reliable about 200 pct.
I make choice of C because it works on PC Win, Mac, Linux, ... and BSD with almost no modification(s).
It can be compiled 10 years after ;) Cool.


Sorry, but the question here (how to implement a text -> text converter that reads a certain documentation format and outputs HTML format) has absolutely nothing to do with free. You can do it with free software, or with licensed paid software. You can create free software, or you can assign the rights to the software to someone else, or you can license and sell it yourself. You can do all of this on MS Windows (yes, there are free C compilers for Windows), and on Unixes. Sorry, but your raving about Windows and freedom has NOTHING to do with the question here.
OK, I was thinking about Microsoft products that change time to time. Users have to use next generation formats, or to buy a new release so that it works with Win 10. You probably understand (although "Learn how to get older programs to run on this version Windows").


I want to see you code this in 5-10 minutes. Really. It would take a seasoned professional hours to come up with a quick solution that works correctly for a few files, and weeks or months to come up with a general solution that is high enough quality to release it to a wide audience. I am sorry, but you have completely lost touch with reality.
It depends the need: professional (for many persons) or easy own audience programming.
You can make a big project and spend years on it, or just less than minutes. Up to what shall do the conversion. Basically, it can be complex since (x)html is really a big complex markup language (today).



Many other languages are also available on BSD. For example awk. This problem could be coded in awk (and it wouldn't even be a particularly bad idea, way better than C). And awk is most certainly in the default installation. To be honest, I don't know whether python and perl are in the default installation (I install them anyway, since they are needed for any computer that can do realistic work).
awk is cool - yeah.
C too ;)



That's complete poppycock. Installing programming languages such as the ones I keep mentioning is
  • not large (the packages are fast to download and install, and only increase the OS footprint by a very small amount),
  • it is not CPU consuming (you keep spewing that nonsense, and when called on the carpet to produce some evidence you go strangely silent),
  • not numerous (the base perl or python package is probably just one or a handful packages),
  • and has nothing to do with graphical. All these programming languages exist in basic, non-graphical versions.
I prefer C over python, because I prefer C / the syntax is nice.
If you want to learn C, just do C ;)


Absolutely. If you can find a ready-to-use library that can parse the documentation format you showed above, then the problem you posed here is nearly solved. Alas, I don't know of such a library. You will have to write it yourself. Which is what I'm trying to get you to start, or at least explain how you would start.
There are many ready available libraries. I used them already, and it worked.
When I was student, I made quite a lot of programming in Pascal : the best was the programming language Pascal. I got fascinated by Pascal. I did not have much chance to learn C so much, because we obliged learn all the Microsoft products (Visual Basic 4, 5, 6, .net,... ) and so on,... brr.
I think that learning (Free) Pascal, Delphi, ... would have been much better.



Well-written code in nearly any programming language is human readable. Even in perl (although admittedly, the standard perl style is a little hard to follow). I know there are exceptions, some that were used for serious work (such as APL), and some that were jokes (such as Intercal).
They are all human readable markup languages, they are fine.
Perl, Ruby,... many people use it for some given reasons.
Perl is readily highly powerful for such things, really.
Anyone should be using Perl for such problem (txt <-> html).


None of this has anything to do with GNU, or freedom. You posed a programming question. Deal with it and start programming. I've given quite a few pointers in this thread, about how to select programming languages, how to think about structuring your program, and so on. Please stop your lunatic raving about Windows, GNU, freedom, and simplicity, and start using a computer.
C is my choice. Because (1) I like the syntax (2) because it works (3) Anywhere + anytime available !!!
Compiling and Portability on Linux, BSD and Windows.


The C programming language: it is stable and reliable.

here get this and compile without asking admin pass: https://bellard.org/tcc/
Portable.
 

hruodr

Well-Known Member

Reaction score: 46
Messages: 362

Spartrekus, two lines with less than 80 chars are enough for doing the same:

Code:
#include <stdio.h>
int main() {int c; while((c=getchar())!=EOF) putchar(c); return 0;}
That is very practical if you are punching cards. I find such compact code more readable.

I like C very much, but for text transformation, higher level tools are better. lex and yacc are used to generate C code: these programs were superfluous if C were enough for an easy lexical analysis and parsing.

Of course you can write the same code manually: it is just a "little" more complicated. Many compilers are written so, but a compiler is written once and used a lot, without needing to modify it, one has time to write it. Something different is to write a program that solves a task, perhaps only once.

A scripting language can be very practical for text transformation. Many of the routines you need are programmed in it. I like tcl, but everyone has his preferences. There are other practical tools, as mentioned lex and yacc, you can call ed inside a script, awk and sed.

But as said, before you begin writing a program that does something, you must recognize what is this something. If you do not recognize any structure in the text other than being a text, then the concatenation of the two lines is the right program. If you recognise only headers, then you can easily write a program that treat the headers and the rest. Other structure may be more complicated, but what is it? Can you conclude what is it only seeing the file you quoted? How should it be treated by your program? If you are for example using lex and yacc, you must try to express the structure with tokens and grammar rules, and see what are the actions to generate the html.
 
OP
OP
Spartrekus

Spartrekus

Daemon

Reaction score: 153
Messages: 1,150

Spartrekus, two lines with less than 80 chars are enough for doing the same:

Code:
#include <stdio.h>
int main() {int c; while((c=getchar())!=EOF) putchar(c); return 0;}
That is very practical if you are punching cards. I find such compact code more readable.

I like C very much, but for text transformation, higher level tools are better. lex and yacc are used to generate C code: these programs were superfluous if C were enough for an easy lexical analysis and parsing.
I am a bit suprised that you are against superfluous methods. Usually, high-level programming language is THE solution. As seen above.

I haven't learned lex. Looks nice, ok to use.

Depending on needs :
Code:
#include <stdio.h>
int main() {int c; while((c=getchar())!=EOF)    if ( c == '\n' ) ....   else   putchar(c);    return 0;}
For instance "<br>" could be added or it could detect an empty line and add "<br>".
 
Top