Mystery character from dot doc

  • Thread starter Deleted member 9563
  • Start date
D

Deleted member 9563

Guest
A current project involves getting documents from someone in .doc format from which I just cut/paste into my text editor for web markup. I just got a surprise when I copied a short list and only half of it "pasted". It seems that MS-Word has (at least) two dash or hyphen-like characters and one of them translates strangely. In this case it was the longer one, presumably intended to be a dash, which caused the problem. Here is a sample of the first lines.

Code:
AMALAGAMATED DOUKHOBOR CHOIR– traditional Doukhobor singing
GRAHAM BALDWIN – songs of life’s seasons & struggles 
JON BARTLETT & RIKA RUEBSAAT– BC & Canadian songs
ROBERT BERTRAND – country blues with guitar& harmonica

I see that paste works just fine in this forum editor. How come it causes a "delete to end of line" effect for me?
 
No idea. But now you may be able to copy from the forum into your project. :D
 
No idea. But now you may be able to copy from the forum into your project. :D
It would be an interesting use of a forum indeed. :) However, pasting from the forum gives the same result. Nevertheless I've already used a similar technique by pasting into the ee editor and saving it, and then pasting from there. There's always a way.

BTW, in ee the character shows up as
Code:
~@~S
and when copied from there shows up as a colon.
 
That's an idea. The .doc files are binary, so not sure it would work, but I'll have a look.
 
Thanks Beastie. I just tried it and that's definitely the best solution to dealing with Word documents. What a treat!
 
Although it seems textproc/antiword solved your problem, you might run into another obstacle, namely the character encoding of the web page.

The dashes which you mentioned in your first post are so called en dashes and these are sometimes used as the poor man's non-breaking dashes in word processors. Its width is supposed to be the same as of the letter n, therefore the name. In Windows-Latin-1 encoding the numerical character code of the en dash is 0x96 in the modern UTF-8 encoding, which is nowadays used by almost everything, the numerical character code is 0xE28093.

Since antiword uses the current locale of your machine for character mapping, you want to check whether your locale matches the character encoding of the web site for which you convert the content. Otherwise you might run into encoding problems once again.

I would straight go with UTF-8, however, in case we are talking about plain English text, ISO-8859-1 would serve as well.

For example in my /etc/csh.login I got:
Code:
...
setenv LANG en_US.UTF-8
setenv MM_CHARSET UTF-8

In this case I would place in the HTML head:
Code:
<!DOCTYPE html><HTML><HEAD>
   <META http-equiv="content-type" content="text/html; charset=UTF-8">
   ...

In any case, the character encodings must match. If you like, you might replace any occasion of UTF-8 with ISO-8859-1, but don’t mix 'em up.

PS: Please forget ISO-8859-1, only now I see, it does not contain the en dash. For avoiding any encoding problems the best would be to stay with UTF-8.
 
Thanks obsigna. That's the kind of detail I was most interested in. :)

I write all my web sites by hand anyway so it's plain text and always UTF-8. I have never had any problems before.

I just now used the file command and see that the dot doc files I'm getting in this case have "Code page: 1252", which I read is also referred to as Windows-1252. I guess that's the issue here and it doesn't translate properly to my UTF-8 environment. (My text editor, web pages, and computer default.)

Anyway, being an old curmudgeon, I generally blame Microsoft for anything that goes wrong, including lack of parking spaces when shopping. :)
 
Disclaimer, I know that the topic „Editors“ is a mined terrain. And the following is not actually a suggestion, but only an example. There are people who call everybody quiche eaters who’re not using vi or ed, so for the time being let’s that granted.

Anyway, the simple text editors, (ed, ee, nano, vi(?), etc.) simply assume that the opened text file got the same character encoding as indicated by the locale settings. This would work fine, as long as you work on text generated on your system or on similar other systems, or as long as you use the capital letters of the 7bit ASCII encoding only. This may miserably fail in case it comes to text from any kind of sources. Here Windows Latin 1 = Windows 1252.

Usually I do text editing on FreeBSD in the Terminal only, using nano, more complex text, I edit on macOS where I don’t know any editor which is not aware of text encoding.

I got set up one FreeBSD desktop system for testing purposes running GNMOE 3. Unfortunately editors/gedit belongs to the more simple editors mentioned above, so forget it for the given purpose. A quick Google search revealed somebody suggesting editors/kate as an encoding aware editor, but this would drag-in all the KDE-hell of dependencies. For this reason, I did't even consider to try this one.

Another suggestion was www/bluefish:
Bluefish is a powerful editor targeted towards programmers and web
developers, with many options to write websites, scripts and programming
code. Bluefish supports many programming and markup languages and has
many features, but is still a very fast and lightweight application.

I installed it right from the binary package repository and it dragged-in weblint as the only dependency besides what was already installed by GNOME 3. I only needed to add Windows-1252 as a known encoding, and the editor opened the file correctly and in addition showed the correct encoding in the status bar (s. screenshot). Needleass to say that you would be able to save the file in a different encoding, e.g. UTF-8.

So again, I don’t actually propose to use the Bluefish Editor, however it might be useful for your next steps to evaluate some editors which might come with some useful functions which facilitate the conversion of the word documents to a web page.

Bluefish Editor.png
 
Disclaimer, I know that the topic „Editors“ is a mined terrain. And the following is not actually a suggestion, but only an example. There are people who call everybody quiche eaters who’re not using vi or ed, so for the time being let’s that granted.

No worries. You can call me anything you like. :) I certainly appreciate your suggestions. Indeed, there are editors that would automatically solve my (now solved) problem.

Actually the real problem was the mystery of the source encoding, and that's solved in spades with file and textproc/antiword. I'll use antiword for translating MS-word documents from now on. Though I do have to hold my nose when people send those. ;) Not being a professional (and no intention of becoming one) I generally just ignore all things from Microsoft other than when someone with no computer knowledge sends me such a file, then I consider it my responsibility to be the one that can cope.

Yeah, bluefish has a great reputation, and my wife likes it for the web. She also swears by kate and makes great use of its tabbing system. Actually, I run KDE, so kate is right here and I can vouch for its encoding-aware abilities.

Like you, I need one that runs in a terminal though, so kate never really works for me, despite its rich features. I've been very happy with editors/ne, especially since the keystrokes are natural to me now that I've been using it for the last ten years. (Actually the key bindings are natural for anybody who started with the first DOS systems.) I'm not going to change that now. :) Besides I never did like an editor that shows anything of its own on the screen other than a cursor.

My web work is pretty basic and I really prefer to work with just plain text written directly from my mind without any help from a program. I consider that the right way for artistic expression, though I see most prefer to use things like bluefish and templates. However, they have different aims from me.
 
Back
Top