How to Get All Pictures Form Galaxy Notepad 8 if You Know Its Ip

Always wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and y'all never quite know what information technology should be?

Did you ever get an email from your friends in Republic of bulgaria with the subject line "???? ?????? ??? ????"?

I've been dismayed to find just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They take email in Japanese? I had no thought. When I looked closely at the commercial ActiveX command we were using to parse MIME email letters, we discovered it was doing exactly the wrong matter with character sets, and so we actually had to write heroic lawmaking to undo the wrong conversion it had washed and redo it correctly. When I looked into another commercial library, it, too, had a completely cleaved character code implementation. I corresponded with the developer of that bundle and he sort of thought they "couldn't practise anything about it." Like many programmers, he but wished it would all blow over somehow.

But it won't. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding problems, blithely using viii $.25 for characters, making it darn near impossible to develop good international web applications, I thought, enough is plenty.

So I accept an announcement to make: if you are a programmer working in 2003 and yous don't know the basics of characters, character sets, encodings, and Unicode, and I take hold of you lot, I'one thousand going to punish yous by making yous peel onions for vi months in a submarine. I swear I will.

And one more thing:

IT'Due south NOT THAT HARD.

In this commodity I'll fill yous in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are viii bits" is not simply incorrect, it's hopelessly incorrect, and if you're still programming that way, y'all're not much improve than a medical medico who doesn't believe in germs. Please practise non write some other line of code until you stop reading this article.

Before I become started, I should warn you that if you are one of those rare people who knows most internationalization, you are going to notice my entire discussion a little fleck oversimplified. I'm really just trying to prepare a minimum bar here so that everyone can sympathize what'south going on and can write code that has a hope of working with text in any language other than the subset of English that doesn't include words with accents. And I should warn you that grapheme handling is only a tiny portion of what it takes to create software that works internationally, just I tin can only write about one thing at a time so today it's character sets.

A Historical Perspective

The easiest way to understand this stuff is to become chronologically.

You probably recall I'm going to talk about very sometime character sets like EBCDIC here. Well, I won't. EBCDIC is not relevant to your life. We don't take to go that far back in fourth dimension.

ASCII tableBack in the semi-olden days, when Unix was being invented and M&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good sometime unaccented English messages, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 $.25. Near computers in those days were using 8-scrap bytes, so not only could you store every possible ASCII graphic symbol, merely you had a whole scrap to spare, which, if you were wicked, you could use for your own stray purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the concluding letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. But kidding. They were used for control characters, like 7 which made your computer beep and 12 which acquired the electric current page of paper to go flying out of the printer and a new ane to be fed in.

And all was good, assuming you were an English speaker.

Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character prepare which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry out cleaners'. In fact  as soon as people started buying PCs outside of America all kinds of unlike OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For case on some PCs the graphic symbol lawmaking 130 would display every bit é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, and so you couldn't even reliably interchange Russian documents.

Somewhen this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to practise below 128, which was pretty much the same as ASCII, just at that place were lots of unlike means to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. Then for example in Israel DOS used a code page chosen 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these lawmaking pages, handling everything from English to Icelandic and they fifty-fifty had a few "multilingual" code pages that could practice Esperanto and Galician on the same computer! Wow! Merely getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom plan that displayed everything using bitmapped graphics, considering Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 $.25. This was usually solved by the messy organisation called DBCS, the "double byte grapheme prepare" in which some letters were stored in one byte and others took two. It was easy to move forward in a cord, but dang near impossible to motility backwards. Programmers were encouraged non to use s++ and s– to move backwards and frontward, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.

Merely however, most people just pretended that a byte was a character and a grapheme was 8 bits and as long equally you never moved a string from one computer to another, or spoke more ane language, it would sort of always work. But of grade, equally soon equally the Internet happened, it became quite commonplace to motility strings from ane computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing arrangement on the planet and some make-believe ones like Klingon, as well. Some people are under the misconception that Unicode is but a xvi-scrap code where each character takes 16 bits and therefore at that place are 65,536 possible characters. This is not, actually, correct. Information technology is the single well-nigh mutual myth about Unicode, and so if you idea that, don't feel bad.

In fact, Unicode has a unlike style of thinking almost characters, and you have to understand the Unicode way of thinking of things or zip will make sense.

Until now, nosotros've assumed that a letter maps to some $.25 which yous tin can shop on deejay or in memory:

A -> 0100 0001

In Unicode, a letter maps to something chosen a code point which is nevertheless only a theoretical concept. How that code indicate is represented in retention or on disk is a whole nuther story.

In Unicode, the letter A is a ideal ideal. It's just floating in heaven:

A

This platonic A is different than B, and different from a, but the same as A and A and A. The thought that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from "a" in lower instance, does non seem very controversial, but in some languages only figuring out what a alphabetic character is can cause controversy. Is the German alphabetic character ß a real alphabetic character or just a fancy manner of writing ss? If a alphabetic character'south shape changes at the stop of the discussion, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the terminal decade or so, accompanied by a great deal of highly political debate, and y'all don't take to worry about it. They've figured information technology all out already.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter of the alphabet Own. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode can define and in fact they accept gone beyond 65,536 then not every unicode letter of the alphabet can really exist squeezed into two bytes, simply that was a myth anyway.

OK, so say we have a cord:

How-do-you-do

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Simply a bunch of lawmaking points. Numbers, actually. We haven't yet said anything about how to store this in memory or represent it in an electronic mail message.

Encodings

That'south where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the ii bytes, was, hey, permit's just store those numbers in two bytes each. So How-do-you-do becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yep, I do believe it could, and, in fact, early implementors wanted to be able to shop their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come with the bizarre convention of storing a Fe FF at the beginning of every Unicode cord; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your cord will know that they have to swap every other byte. Phew. Non every Unicode cord in the wild has a byte club marker at the start.

For a while it seemed similar that might be good enough, but programmers were complaining. "Look at all those zeros!" they said, since they were Americans and they were looking at English language text which rarely used lawmaking points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn't take minded guzzling twice the number of bytes. Just those Californian wimps couldn't bear the thought of doubling the amount of storage it took for strings, and anyway, at that place were already all these doggone documents out there using various ANSI and DBCS character sets and who's going to catechumen them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

Thus was invented the brilliant concept of UTF-8. UTF-8 was some other organization for storing your cord of Unicode lawmaking points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-eight, every lawmaking betoken from 0-127 is stored in a single byte. Merely lawmaking points 128 and in a higher place are stored using 2, 3, in fact, up to half dozen bytes.

How UTF-8 works

This has the neat side effect that English text looks exactly the aforementioned in UTF-eight as it did in ASCII, so Americans don't even detect annihilation wrong. Simply the rest of the world has to spring through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored every bit 48 65 6C 6C 6F, which, behold! is the same every bit information technology was stored in ASCII, and ANSI, and every OEM graphic symbol attack the planet. At present, if yous are so bold as to utilise accented letters or Greek letters or Klingon letters, you'll have to apply several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice belongings that ignorant old string-processing code that wants to use a single 0 byte every bit the null-terminator will not truncate strings).

So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (considering it has two bytes) or UTF-xvi (considering it has 16 $.25), and you lot still have to figure out if it's loftier-endian UCS-2 or depression-endian UCS-two. And in that location's the popular new UTF-viii standard which has the nice belongings of also working respectably if you lot have the happy coincidence of English text and braindead programs that are completely unaware that in that location is anything other than ASCII.

There are actually a agglomeration of other means of encoding Unicode. There'southward something called UTF-7, which is a lot like UTF-eight only guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-country email organisation that thinks seven bits are quite enough, give thanks you it can still squeeze through unscathed. At that place's UCS-4, which stores each lawmaking point in 4 bytes, which has the nice holding that every single code signal can be stored in the same number of bytes, only, golly, even the Texans wouldn't exist so bold as to waste that much retentivity.

And in fact at present that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode lawmaking points can be encoded in whatever erstwhile-school encoding scheme, too! For example, you lot could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented then far, with one grab: some of the letters might not show up! If there's no equivalent for the Unicode code bespeak you're trying to stand for in the encoding you're trying to stand for information technology in, y'all unremarkably become a little question marking: ? or, if you're actually proficient, a box. Which did you become? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-ane, aka Latin-1 (as well useful for any Western European linguistic communication). Only attempt to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all accept the nice property of being able to store any code point correctly.

The Single Nearly Important Fact Near Encodings

If y'all completely forget everything I just explained, please recollect one extremely important fact. It does not brand sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.

There Ain't No Such Thing As Plain Text.

If you take a string, in memory, in a file, or in an electronic mail message, you have to know what encoding it is in or you cannot interpret it or brandish it to users correctly.

Nearly every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" trouble comes downwards to 1 naive developer who didn't sympathize the elementary fact that if you don't tell me whether a particular string is encoded using UTF-viii or ASCII or ISO 8859-1 (Latin ane) or Windows 1252 (Western European), you just cannot display it correctly or fifty-fifty figure out where it ends. In that location are over a hundred encodings and above lawmaking indicate 127, all bets are off.

How do we preserve this data virtually what encoding a string uses? Well, at that place are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Blazon: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would render a similar Content-Type http header along with the web page itself — non in the HTML itself, simply every bit one of the response headers that are sent before the HTML folio.

This causes problems. Suppose you lot have a big web server with lots of sites and hundreds of pages contributed past lots of people in lots of different languages and all using whatsoever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn't really know what encoding each file was written in, so information technology couldn't send the Content-Type header.

Information technology would be convenient if y'all could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can y'all read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters betwixt 32 and 127, so you can ever go this far on the HTML page without starting to use funny letters:

<html>
<caput>
<meta http-equiv="Content-Blazon" content="text/html; charset=utf-eight">

But that meta tag really has to be the very kickoff thing in the <head> section because equally soon as the web browser sees this tag it'due south going to stop parsing the page and outset over after reinterpreting the whole folio using the encoding you specified.

What do spider web browsers do if they don't observe whatsoever Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: information technology tries to approximate, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the diverse erstwhile eight chip lawmaking pages tended to put their national messages in different ranges between 128 and 255, and considering every human language has a different characteristic histogram of alphabetic character usage, this really has a take a chance of working. It's truly weird, but information technology does seem to work often plenty that naïve web-page writers who never knew they needed a Content-Type header look at their page in a spider web browser and it looks ok, until one day, they write something that doesn't exactly adjust to the letter of the alphabet-frequency-distribution of their native linguistic communication, and Net Explorer decides it's Korean and displays it thusly, proving, I call up, the point that Postel'southward Police about being "conservative in what you emit and liberal in what yous accept" is quite bluntly non a good engineering science principle. Anyhow, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not fifty-fifty cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture show comes in clearer. If he knew to do that, which near people don't.

For the latest version of CityDesk, the web site management software published by my company, we decided to practise everything internally in UCS-2 (2 byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code nosotros just declare strings equally wchar_t ("wide char") instead of char and use the wcs functions instead of the str functions (for instance wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 cord in C code y'all just put an L before it every bit and then: Fifty"Howdy".

When CityDesk publishes the spider web page, it converts it to UTF-viii encoding, which has been well supported by spider web browsers for many years. That'due south the way all 29 language versions of Joel on Software are encoded and I accept not however heard a single person who has had whatsoever problem viewing them.

This article is getting rather long, and I can't mayhap cover everything there is to know about graphic symbol encodings and Unicode, but I hope that if you've read this far, y'all know plenty to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.

How to Get All Pictures Form Galaxy Notepad 8 if You Know Its Ip

Source: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

0 Response to "How to Get All Pictures Form Galaxy Notepad 8 if You Know Its Ip"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel