Reading Writing Text  «Prev  Next»

Lesson 3The Reader and Writer classes
ObjectiveExamine the Reader and Writer classes for converting character-based data.

Java Readers and Writers

The java.io.Reader and java.io.Writer classes are abstract superclasses for classes that read and write character-based data.
The subclasses are notable for handling the conversion between different character sets.
Input and output streams are fundamentally byte-based. However, readers and writers are based on characters. In Java, a char is a two-byte Unicode character; but in other character sets that you may have to read or write, characters can have varying widths. ASCII and ISO Latin-1 use one-byte characters. Unicode uses two-byte characters. UTF-8 uses characters of varying width between one and three bytes. Readers and writers know how to handle all these character sets and many more seamlessly. The encoding and decoding classes themselves are hidden in the sun packages. They are used internally by the InputStreamReader and OutputStreamWriter classes to convert the bytes used by the streams into chars used by the readers and writers and vice versa.
You can also convert byte arrays in a particular character set into Unicode strings using two String constructors.

Java String Constructors

public String(byte bytes[], int offset, int length, String encoding) throws UnsupportedEncodingException
 
public String(byte bytes[], String encoding) throws UnsupportedEncodingException

If you have a byte array t that you know is encoded with the ISO 8859-9 characters set (essentially ASCII plus Turkish), you would convert it into Unicode like this:
String s = new String(t, "8859-9");


The exact list of encodings available varies a little from platform to platform. Here are most of the important Character Set Encodings. This same set of encodings is used by readers that convert all bytes they read into Unicode chars.

Character Set Encodings

Character Set Encodings

Name

Character Set

Description

8859_1 ISO 8859-1 (Latin-1)
Western European languages
8859_2 ISO 8859-2 (Latin Extended-A)
In combination with the Latin-1 character set, this set provides for most Central European languages
8859_3 ISO 8859-3 (Latin Extended-B)
Esperanto
8859_4 ISO 8859-4 (Latin Extended-C)
Baltic
8859_5 ISO 8859-5 Latin/Cyrillic
8859_6 ISO 8859-6 Latin/Arabic
8859_7 ISO 8859-7 Latin/Greek
8859_8 ISO 8859-8 Latin/Hebrew
8859_9 ISO 8859-9 Latin/Turkish
Big5 The Big 5
encoding for Chinese
 
CNS11643 Chinese  
Cp037 EBCDIC
American English
 
Cp273 IBM273  
Cp277 EBCDIC
Danish/Norwegian
 
Cp278 EBCDIC
Finnish/Swedish
 
Cp280 EBCDIC
Italian
 
Cp284 EBCDIC
Spanish
 
Cp285 EBCDIC
UK English
 
Cp297 EBCDIC
French
 
Cp420 EBCDIC
Arabic 1
 
Cp424 EBCDIC
Hebrew
 
Cp437 Original DOS IBM PC character set Mostly ASCII with some extra characters for drawing lines and boxes
Cp500 EBCDIC
Flemish/Romulsch
 
Cp737 DOS
Greek
 
Cp775 DOS
Baltic
 
Cp850 DOS
Latin-1
 
Cp852 DOS
Latin-2
 
Cp855 DOS
Cyrillic
 
Cp856 IBM856  
Cp857 DOS
Turkish
 
Cp860 DOS
Portuguese
 
Cp861 DOS
Icelandic
 
Cp862 DOS
Hebrew
 
Cp863 DOS
Canadian French
 
Cp864 DOS
Arabic
 
Cp865 IBM865  
Cp866 IBM866  
Cp868 EBCDIC
Arabic
 
Cp869 DOS
modern Greek
 
Cp870 EBCDIC
Serbian
 
Cp871 EBCDIC
Icelandic
 
Cp874 Windows
Thai
 
Cp875 IBM875  
Cp918 EBCDIC
Arabic 2
 
Cp921 IBM921  
Cp922 IBM922  
Cp1006 IBM1006  
Cp1025 IBM1025  
Cp1026 IBM1026  
Cp1046 IBM1046  
Cp1097 IBM1097  
Cp1098 IBM1098  
Cp1112 IBM1112  
Cp1122 IBM1122  
Cp1123 IBM1123  
Cp1124 IBM1124  
Cp1250 Windows
Eastern European
Essentially ISO Latin-2
Cp1251 Windows
Cyrillic
 
Cp1252 Windows
Western European
Essentially ISO Latin-1
Cp1253 Windows
Greek
 
Cp1254 Windows
Turkish
 
Cp1255 Windows
Hebrew
 
Cp1256 Windows
Arabic
 
Cp1257 Windows
Baltic
 
Cp1258 Windows
Vietnamese
 
EUCJIS Japanese EUC  
GB2312 Chinese  
JIS Japanese Hiragana  
JIS0208 Japanese  
KSC5601 Korean  
MacArabic The Macintosh Arabic character set  
MacCentralEurope The Macintosh Central European character set  
MacCroatian The Macintosh Croatian character set  
MacCyrillic The Macintosh Cyrillic character set  
MacDingbat Zapf Dingbats  
MacGreek The Macintosh modern Greek character set  
MacHebrew The Macintosh Hebrew character set  
MacIceland The Macintosh Icelandic character set  
MacRoman The standard Macintosh U.S. English character set  
MacRomania The Macintosh Romanian character set  
MacSymbol The Adobe Symbol font Includes a complete Greek alphabet in place of the usual Roman letters
MacThai The Macintosh Thai character set  
MacTurkish The Macintosh Turkish character set  
MacUkraine The Macintosh Ukrainian character set  
SJIS Windows
Japanese
 
UTF8 UCS Transformation Format 8-bit form
Unicode Normal Unicode  
UnicodeBig Unicode With big-endian byte order
UnicodeLittle Unicode With little-endian byte order
UnicodeBig-
Unmarked
Unicode With big-endian byte order but without an FEFF marking the start of Unicode text
UnicodeLittle-
Unmarked
Unicode With little-endian byte order but without an FFFE marking the start of Unicode text



Java I/O
  1. FileReader: This class is used to read character files. Its read() methods are fairly low-level, allowing you to read single characters, the whole stream of characters, or a fixed number of characters. FileReaders are usually wrapped by higher-level objects such as BufferedReaders, which improve performance and provide more convenient ways to work with the data.
  2. BufferedReader: This class is used to make lower-level Reader classes like FileReader more efficient and easier to use. Compared to FileReaders, BufferedReaders read relatively large chunks of data from a file at once and keep this data in a buffer. When you ask for the next character or line of data, it is retrieved from the buffer, which minimizes the number of times that time-intensive, file-read operations are performed. In addition, BufferedReader provides more convenient methods, such as readLine(), that allow you to get the next line of characters from a file.
  3. FileWriter: This class is used to write to character files. Its write() methods allow you to write character(s) or strings to a file. FileWriters are usually wrapped by higher-level Writer objects, such as BufferedWriters or PrintWriters, which provide better performance and higher-level, more flexible methods to write data.
  4. BufferedWriter: This class is used to make lower-level classes like FileWriters more efficient and easier to use. Compared to FileWriters, BufferedWriters write relatively large chunks of data to a file at once, minimizing the number of times that slow, file-writing operations are performed. The BufferedWriter class also provides a newLine() method to create platform-specific line separators automatically.
  5. PrintWriter: This class has been enhanced significantly in Java 5. Because of newly created methods and constructors (like building a PrintWriter with a File or a String), you might find that you can use PrintWriter in places where you previously needed a Writer to be wrapped with a FileWriter and/or a BufferedWriter. New methods like format(), printf(), and append() make PrintWriters very flexible and powerful.


Reader Writer Classes
Every platform has a default character set that's used when no other is explicitly specified. On Windows that's likely to be ISO Latin-1. On the Mac it's likely to be MacRoman.

SEMrush Software