Discussion:
[lfs-support] differences between en_US, en_US.iso88591, and en_US.utf8
Michael Havens
2014-09-01 02:11:09 UTC
Permalink
Hola! I am in section 7.13 and am now attempting to figure out my locale. I
live in the United States and speak English.... well not British English,
but anyways! So I run 'locale -a' and get this list: I am told the first
two letters represent the language and the second two letters represent the
country. but what about the characters after that? in my particular case I
would choose en_US, en_US.iso88591, or en_US.utf8. If I remember correctly
from what I've seen I should select en_US.iso88591 but I am not sure. I
also would like to know what the differences are between the three and why
I should select one over the other... if that is not two much trouble...
okay after a little more looking found that:

The only difference between en_US and en_US.utf8 is that the former uses
ISO-8859-1 for a character set, while the latter uses UTF-8. *Prefer UTF-8.*
The only difference in these is in what characters they are capable of
representing. ISO-8859-1 represents characters common to many Americans
(the English alphabet, plus a few letters with accents), whereas UTF-8
encodes all of Unicode, and thus, just about any language you can think of.
UTF-8, today, is a defacto standard encoding for text. (Which is why you
should prefer it.)

I am assuming from the previous text (found here
<http://serverfault.com/questions/605776/linux-locale-en-us-utf-8-vs-en-us>)
that en_US is an alias for en_US.iso88591 . It seems I am correct in that
assumption:
'LC_ALL=en_US locale charmap' reveals
ISO-8859-1
I am thinking it is an alias! Am I correct?


:-)~MIKE~(-:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfromscratch.org/pipermail/lfs-support/attachments/20140831/d36b1c8f/attachment.html>
Ken Moffat
2014-09-01 03:25:39 UTC
Permalink
Post by Michael Havens
Hola! I am in section 7.13 and am now attempting to figure out my locale. I
live in the United States and speak English.... well not British English,
but anyways! So I run 'locale -a' and get this list: I am told the first
two letters represent the language and the second two letters represent the
country. but what about the characters after that? in my particular case I
would choose en_US, en_US.iso88591, or en_US.utf8. If I remember correctly
from what I've seen I should select en_US.iso88591 but I am not sure. I
also would like to know what the differences are between the three and why
I should select one over the other... if that is not two much trouble...
The only difference between en_US and en_US.utf8 is that the former uses
ISO-8859-1 for a character set, while the latter uses UTF-8. *Prefer UTF-8.*
The only difference in these is in what characters they are capable of
representing. ISO-8859-1 represents characters common to many Americans
(the English alphabet, plus a few letters with accents), whereas UTF-8
encodes all of Unicode, and thus, just about any language you can think of.
UTF-8, today, is a defacto standard encoding for text. (Which is why you
should prefer it.)
I am assuming from the previous text (found here
<http://serverfault.com/questions/605776/linux-locale-en-us-utf-8-vs-en-us>)
that en_US is an alias for en_US.iso88591 . It seems I am correct in that
'LC_ALL=en_US locale charmap' reveals
ISO-8859-1
I am thinking it is an alias! Am I correct?
It used to be. For modern glibc, I have no idea. Why not just use
the extra six characters and specify en_US.UTF-8 ?

?en
--
Nanny Ogg usually went to bed early. After all, she was an old lady.
Sometimes she went to bed as early as 6 a.m.
Michael Havens
2014-09-01 05:03:47 UTC
Permalink
It used to be. For modern glibc, I have no idea. Why not just use
Post by Ken Moffat
the extra six characters and specify en_US.UTF-8 ?
en_US.UTF-8? Really? I was thinking the right thing to do was to use
en_US.iso88591 because en_US locale is it's alias. So, you are saying to
use en_US.UTF-8?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfromscratch.org/pipermail/lfs-support/attachments/20140831/5e962f9d/attachment.html>
Simon Geard
2014-09-01 08:29:06 UTC
Permalink
Post by Ken Moffat
en_US.UTF-8? Really? I was thinking the right thing to do was to use
en_US.iso88591 because en_US locale is it's alias. So, you are saying
to use en_US.UTF-8?
Yes. UTF-8 is modern and standard, designed to cope with non-English
characters. ISO-8859-1 isn't... it's basically a very limited extension
of ASCII to add the most common accented characters and currency
symbols, and it has no practical advantages unless you're stuck with
crappy software that can't deal with anything better

I'm ranting a little, but 8859-1 has been the bane of my life lately,
working on a software translation project... too much old code that
can't understand the need to deal with Japanese text...

Simon.
Bruce Dubbs
2014-09-01 16:45:06 UTC
Permalink
Post by Simon Geard
Post by Ken Moffat
en_US.UTF-8? Really? I was thinking the right thing to do was to use
en_US.iso88591 because en_US locale is it's alias. So, you are saying
to use en_US.UTF-8?
Yes. UTF-8 is modern and standard, designed to cope with non-English
characters. ISO-8859-1 isn't... it's basically a very limited extension
of ASCII to add the most common accented characters and currency
symbols, and it has no practical advantages unless you're stuck with
crappy software that can't deal with anything better
I'm ranting a little, but 8859-1 has been the bane of my life lately,
working on a software translation project... too much old code that
can't understand the need to deal with Japanese text...
Simon has some legitimate comments, but I don't use any setting for
locale/LANG. In my case, the man pages don't work properly:

LANG=en_US.UTF-8 man man

gives me things like:

The manual page associated with each of these argu?<80><90>
========
LANG=en_US.UTF-8 ls

also gives me a sort order that is not case sensitive. I do not like that.
========
Note that graphical applications like mail clients and browsers often
have their own independent locale settings.

-- Bruce
Ken Moffat
2014-09-01 17:27:49 UTC
Permalink
Post by Bruce Dubbs
Simon has some legitimate comments, but I don't use any setting for
LANG=en_US.UTF-8 man man
The manual page associated with each of these argu?<80><90>
========
LANG=en_US.UTF-8 ls
My locale settings are
LANG=en_GB.UTF-8
LC_ALL=en_GB.UTF-8

At the moment I'm on an old 7.5 system to test something else, but
there 'man man' is fine for me. Pasting, and allowing mutt to
reformat it -

DESCRIPTION
man is the system's manual pager. Each page argument given
to man is normally the name of
a program, utility or function. The manual page associated
with each of these arguments
is then found and displayed.

(also checked in a tty, again no problem). Your result almost looks
like a "unicode in legacy charset" result, but I cannot see any
reason why the middle of 'arguments' would generate unicode, nor
highlighting codes. Very odd.
Post by Bruce Dubbs
also gives me a sort order that is not case sensitive. I do not like that.
========
A case-sensitive sort order is something that really annoys me,
so I am pleased tant en_GB.UTF-8 is case insensitive. I guess that
is another example of "you can't please everyone".
Post by Bruce Dubbs
Note that graphical applications like mail clients and browsers often have
their own independent locale settings.
-- Bruce
I was not aware of that.

?en
--
Nanny Ogg usually went to bed early. After all, she was an old lady.
Sometimes she went to bed as early as 6 a.m.
Emanuele Rusconi
2014-09-01 19:20:26 UTC
Permalink
Post by Ken Moffat
Post by Bruce Dubbs
also gives me a sort order that is not case sensitive. I do not like that.
========
A case-sensitive sort order is something that really annoys me,
so I am pleased tant en_GB.UTF-8 is case insensitive. I guess that
is another example of "you can't please everyone".
The sort order can be set to another locale with LC_COLLATE.
I use "POSIX" for that.
There are many LC_* variables that can be set independently.
See "man 1 locale", "man 5 locale", "man 7 locale".

-- Emanuele Rusconi
Bruce Dubbs
2014-09-01 19:44:23 UTC
Permalink
Post by Emanuele Rusconi
Post by Ken Moffat
Post by Bruce Dubbs
also gives me a sort order that is not case sensitive. I do not like that.
========
A case-sensitive sort order is something that really annoys me,
so I am pleased tant en_GB.UTF-8 is case insensitive. I guess that
is another example of "you can't please everyone".
Sure you can. That's why the L* variables exist.
Post by Emanuele Rusconi
The sort order can be set to another locale with LC_COLLATE.
I use "POSIX" for that.
There are many LC_* variables that can be set independently.
See "man 1 locale", "man 5 locale", "man 7 locale".
Certainly, but I have no need for non-ascii characters in a terminal so
I prefer to leave LANG and LC_* unset.

-- Bruce
Emanuele Rusconi
2014-09-01 19:54:32 UTC
Permalink
Certainly, but I have no need for non-ascii characters in a terminal so I
prefer to leave LANG and LC_* unset.
Clearly you don't listen to bands like ?nglag?rd :)

-- Emanuele Rusconi
Bruce Dubbs
2014-09-01 19:59:48 UTC
Permalink
Post by Emanuele Rusconi
Certainly, but I have no need for non-ascii characters in a terminal so I
prefer to leave LANG and LC_* unset.
Clearly you don't listen to bands like ?nglag?rd :)
I prefer Beethoven.

-- Bruce
Emanuele Rusconi
2014-09-01 20:07:00 UTC
Permalink
Post by Bruce Dubbs
Post by Emanuele Rusconi
Clearly you don't listen to bands like ?nglag?rd :)
I prefer Beethoven.
Ah, you're more on the classical side. I should have chosen Arvo P?rt
then.

-- Emanuele Rusconi
Simon Geard
2014-09-02 08:18:35 UTC
Permalink
Post by Bruce Dubbs
Certainly, but I have no need for non-ascii characters in a terminal so
I prefer to leave LANG and LC_* unset.
Whereas while I can personally only speak English with any fluency, I
travel regularly for pleasure, and work with people from around the
world as part of my job. Being able to correctly deal with all manner of
accented characters (and more recently, non-latin alphabets) is
essential.

Simon.

Walter Webb
2014-09-01 19:20:00 UTC
Permalink
Post by Bruce Dubbs
========
LANG=en_US.UTF-8 ls
also gives me a sort order that is not case sensitive. I do not like that.
========
export LC_COLLATE=POSIX
in /etc/profile works for me
akhiezer
2014-09-01 19:38:25 UTC
Permalink
Date: Mon, 01 Sep 2014 15:20:00 -0400
From: Walter Webb <ngogn at earthlink.net>
To: LFS Support List <lfs-support at lists.linuxfromscratch.org>
Subject: Re: [lfs-support] differences between en_US, en_US.iso88591,
and en_US.utf8
Post by Bruce Dubbs
========
LANG=en_US.UTF-8 ls
also gives me a sort order that is not case sensitive. I do not like that.
========
export LC_COLLATE=POSIX
in /etc/profile works for me
- yep, likewise in (from slackware) /etc/profile.d/lang.sh :
----
# One side effect of the newer locales is that the sort order
# is no longer according to ASCII values, so the sort order will
# change in many places. Since this isn't usually expected and
# can break scripts, we'll stick with traditional ASCII sorting.
# If you'd prefer the sort algorithm that goes with your $LANG
# setting, comment this out.
export LC_COLLATE=C

# End of /etc/profile.d/lang.sh
----
(& sim for /etc/profile.d/lang.csh ).



akh





--
Ken Moffat
2014-09-01 16:28:35 UTC
Permalink
Post by Ken Moffat
It used to be. For modern glibc, I have no idea. Why not just use
Post by Ken Moffat
the extra six characters and specify en_US.UTF-8 ?
en_US.UTF-8? Really? I was thinking the right thing to do was to use
en_US.iso88591 because en_US locale is it's alias. So, you are saying to
use en_US.UTF-8?
To add to what Simon said -

Yes, ???, ??, kyll?, j?, j?, c? [ translations taken from google
translate, limited to to those characters which I expect to be able
to read in a tty ? ]. In ISO-8859-1 you would not be able to read
the cyrillic or greek, and I dare say that the macron on the a
probably doesn't render either. Unfortunately, translations of
'yes' do not show up some of the Eastern European latin characters
which are fairly commonly encountered, such as c with caron ?, l
with stroke ?, o with double acute ?.

For example, there was a post on one of the lists last week from
somebody in Vietnam - in his sig he used the vietnamese version of
his name with diacritical marks not commonly used in european
languages (I think there was an 'i' with a dot below it : no I
cannot read that in a tty unless I restrict my console font to
vietnamese, but I can read it in a graphical term with some fonts
installed). And we often get Eastern Europeans on the lists - it
is nice to be able to read people's names ins their postings.

I saved this before sending it, and on the version of glibc in
LFS-7.4 [ my mail is on my server ] I can confirm that en_US was
indeed latin-1, and unable to make sense of some of my examples,
so I guess it always will be.

? Using my own LatGrkCyr fonts, of course ;-)

?en
--
Nanny Ogg usually went to bed early. After all, she was an old lady.
Sometimes she went to bed as early as 6 a.m.
Continue reading on narkive:
Loading...