Programming

Mac과 Windows에서 Excel로 CSV 파일을 올바르게 여는 인코딩은 무엇입니까?

procodes 2020. 7. 1. 21:53
반응형

Mac과 Windows에서 Excel로 CSV 파일을 올바르게 여는 인코딩은 무엇입니까?


BOM이 아닌 UTF-8의 외부 문자가 포함 된 CSV 파일을 내보내는 웹앱이 있습니다. Windows와 Mac 사용자 모두 Excel에서 가비지 문자를받습니다. BOM을 사용하여 UTF-8로 변환을 시도했습니다. 엑셀 / 윈은 괜찮습니다. 엑셀 / 맥은 횡설수설합니다. Excel 2003 / Win, Excel 2011 / Mac을 사용하고 있습니다. 내가 시도한 모든 인코딩은 다음과 같습니다.

Encoding  BOM      Win                            Mac
--------  ---      ----------------------------   ------------
utf-8     --       scrambled                      scrambled
utf-8     BOM      WORKS                          scrambled
utf-16    --       file not recognized            file not recognized
utf-16    BOM      file not recognized            Chinese gibberish
utf-16LE  --       file not recognized            file not recognized
utf-16LE  BOM      characters OK,                 same as Win
                   row data all in first field

BOM이있는 UTF-16LE가 가장 좋지만 CSV는 그렇게 인식되지 않습니다. 필드 구분 기호는 쉼표이지만 세미콜론은 변경되지 않습니다.

두 세계에서 작동하는 인코딩이 있습니까?


엑셀 인코딩

나는 발견 된 WINDOWS-1252엑셀을 다룰 때 가장 좌절로 인코딩. 기본적으로 Microsoft 자체의 독점 문자 세트이므로 Mac 및 Windows 버전의 MS-Excel에서 모두 작동한다고 가정 할 수 있습니다. 두 버전 모두 최소한 데이터를 올바르게 읽는 해당 "파일 원본"또는 "파일 인코딩"선택기가 포함되어 있습니다.

시스템 및 사용하는 도구에 따라이 인코딩은 이름이 될 수있다 CP1252, ANSI, Windows (ANSI), MS-ANSI아니면 그냥 Windows다른 변화들.

이 인코딩은 ISO-8859-1(일명 LATIN1기타) 의 상위 집합 이므로 어떤 이유로 ISO-8859-1사용할 수없는 경우 대체 할 수 있습니다 WINDOWS-1252. 다음과 같이 ISO-8859-1일부 문자가 누락되었습니다 WINDOWS-1252.

| Char | ANSI | Unicode | ANSI Hex | Unicode Hex | HTML entity | Unicode Name                               | Unicode Range            |
| €    | 128  | 8364    | 0x80     | U+20AC      | €      | euro sign                                  | Currency Symbols         |
| ‚    | 130  | 8218    | 0x82     | U+201A      | ‚     | single low-9 quotation mark                | General Punctuation      |
| ƒ    | 131  | 402     | 0x83     | U+0192      | ƒ      | Latin small letter f with hook             | Latin Extended-B         |
| „    | 132  | 8222    | 0x84     | U+201E      | „     | double low-9 quotation mark                | General Punctuation      |
| …    | 133  | 8230    | 0x85     | U+2026      | …    | horizontal ellipsis                        | General Punctuation      |
| †    | 134  | 8224    | 0x86     | U+2020      | †    | dagger                                     | General Punctuation      |
| ‡    | 135  | 8225    | 0x87     | U+2021      | ‡    | double dagger                              | General Punctuation      |
| ˆ    | 136  | 710     | 0x88     | U+02C6      | ˆ      | modifier letter circumflex accent          | Spacing Modifier Letters |
| ‰    | 137  | 8240    | 0x89     | U+2030      | ‰    | per mille sign                             | General Punctuation      |
| Š    | 138  | 352     | 0x8A     | U+0160      | Š    | Latin capital letter S with caron          | Latin Extended-A         |
| ‹    | 139  | 8249    | 0x8B     | U+2039      | ‹    | single left-pointing angle quotation mark  | General Punctuation      |
| Œ    | 140  | 338     | 0x8C     | U+0152      | Œ     | Latin capital ligature OE                  | Latin Extended-A         |
| Ž    | 142  | 381     | 0x8E     | U+017D      |             | Latin capital letter Z with caron          | Latin Extended-A         |
| ‘    | 145  | 8216    | 0x91     | U+2018      | ‘     | left single quotation mark                 | General Punctuation      |
| ’    | 146  | 8217    | 0x92     | U+2019      | ’     | right single quotation mark                | General Punctuation      |
| “    | 147  | 8220    | 0x93     | U+201C      | “     | left double quotation mark                 | General Punctuation      |
| ”    | 148  | 8221    | 0x94     | U+201D      | ”     | right double quotation mark                | General Punctuation      |
| •    | 149  | 8226    | 0x95     | U+2022      | •      | bullet                                     | General Punctuation      |
| –    | 150  | 8211    | 0x96     | U+2013      | –     | en dash                                    | General Punctuation      |
| —    | 151  | 8212    | 0x97     | U+2014      | —     | em dash                                    | General Punctuation      |
| ˜    | 152  | 732     | 0x98     | U+02DC      | ˜     | small tilde                                | Spacing Modifier Letters |
| ™    | 153  | 8482    | 0x99     | U+2122      | ™     | trade mark sign                            | Letterlike Symbols       |
| š    | 154  | 353     | 0x9A     | U+0161      | š    | Latin small letter s with caron            | Latin Extended-A         |
| ›    | 155  | 8250    | 0x9B     | U+203A      | ›    | single right-pointing angle quotation mark | General Punctuation      |
| œ    | 156  | 339     | 0x9C     | U+0153      | œ     | Latin small ligature oe                    | Latin Extended-A         |
| ž    | 158  | 382     | 0x9E     | U+017E      |             | Latin small letter z with caron            | Latin Extended-A         |
| Ÿ    | 159  | 376     | 0x9F     | U+0178      | Ÿ      | Latin capital letter Y with diaeresis      | Latin Extended-A         |

점을 유의 유로 기호가없는 . 이 테이블은 Alan Wood 에서 찾을 수 있습니다 .

변환

도구와 언어에 따라 변환이 다르게 수행됩니다. 그러나 query_result.csv알고 있는 파일 UTF-8인코딩 되어 있다고 가정합니다 . 다음을 WINDOWS-1252사용하여 변환하십시오 iconv.

iconv -f UTF-8 -t WINDOWS-1252 query_result.csv > query_result-win.csv

BOM이있는 UTF-16LE의 경우 탭 문자를 쉼표 대신 구분자로 사용하면 Excel에서 필드를 인식합니다. 그것이 작동하는 이유는 Excel이 실제로 유니 코드 * .txt 파서를 사용하여 끝내기 때문입니다.

주의 사항 : 파일을 Excel에서 편집하여 저장하면 탭으로 구분 된 ASCII로 저장됩니다. 이제 문제는 파일을 다시 열면 Excel에서 파일이 실제 CSV (쉼표 포함)라고 가정하고 유니 코드가 아니라는 것을 확인하여 쉼표로 구분하여 파싱하므로 해시를 생성한다는 것입니다!

Update: The above caveat doesn't appear to be happening for me today in Excel 2010 (Windows) at least, although there does appear to be a difference in saving behaviour if:

  • you edit and quit Excel (tries to save as 'Unicode *.txt')

compared to:

  • editing and closing just the file (works as expected).

The lowdown is: There is no solution. Excel 2011/Mac cannot correctly interpret a CSV file containing umlauts and diacritical marks no matter what encoding or hoop jumping you do. I'd be glad to hear someone tell me different!


You only have tried comma-separated and semicolon-separated CSV. If you had tried tab-separated CSV (also called TSV) you would have found the answer:

UTF-16LE with BOM (byte order mark), tab-separated


But: In a comment you mention that TSV is not an option for you (I haven't been able to find this requirement in your question though). That's a pity. It often means that you allow manual editing of TSV files, which probably is not a good idea. Visual checking of TSV files is not a problem. Furthermore editors can be set to display a special character to mark tabs.

And yes, I tried this out on Windows and Mac.


Here's the clincher on importing utf8-encoded CSV into Excel 2011 for Mac: Microsoft says: "Excel for Mac does not currently support UTF-8." Excel for Mac 2011 and UTF-8

Yay, way to go MS!


The best workaround for reading CSV files with UTF-8 on Mac is to convert them into XLSX format. I have found a script made by Konrad Foerstner, which I have improved little bit by adding support for different delimiter characters.

Download the script from Github https://github.com/brablc/clit/blob/master/csv2xlsx.py. In order to run it you will need to install a python module openpyxl for Excel file manipulation: sudo easy_install openpyxl.


It seems to my case that Excel 2011 for Mac OS is not using Encoding.GetEncoding("10000") as i thought and wasted 2 days with but the same iso as on Microsoft OS. The best proof for this is to make a file in Excel 2011 for MAC with special chars, save it as CSV and then open it in MAC text editor and the chars are scrambled.

For me this approach worked - meaning that csv export on Excel 2011 on MAC OS has special western europeean chars inside:

Encoding isoMacOS = Encoding.GetEncoding("iso-8859-1");
Encoding defaultEncoding = Encoding.Default; 

// Convert the string into a byte array.
byte[] defaultEncodingBytes = defaultEncoding.GetBytes(exportText);

// Perform the conversion from one encoding to the other.
byte[] ansiBytes = Encoding.Convert(defaultEncoding, isoMacOS, defaultEncodingBytes);

decodedString = isoMacOS.GetString(ansiBytes);

UTF-8 with no BOM currently works for me in Excel Mac 2011 14.3.2.

UTF-8 + BOM kind of works, but BOM rendered as gibberish.

UTF-16 works if you Import the file and complete the wizard, but not if you just double-click it.


The following worked for me on Excel for Mac 2011 and Windows Excel 2002:

  1. Using iconv on Mac, convert the file to UTF-16 Little-Endian + name it *.txt (the .txt extension forces Excel to run the Text Import Wizard):

    iconv -f UTF-8 -t UTF-16LE filename.csv >filename_UTF-16LE.csv.txt

  2. Open the file in Excel and in the Text Import Wizard choose:

    • Step 1: File origin: ignore it, it doesn't matter what you choose
    • Step 2: select proper values for Delimiters and Text qualifier
    • Step 3: if necessary, select column formats

PS The UTF-16LE created by iconv has BOM bytes FF FE in the beginning.

PPS My original csv file was created on a Windows 7 computer, in UTF-8 format (with the BOM bytes EF BB BF in the beginning) and used CRLF line breaks. Comma was used as field delimiter and single quote as text qualifier. It contained ASCII letters plus different latin letters with tildes, umlaut etc, plus some cyrillic. All displayed properly in both Excel for Win and Mac.

PPPS Exact software versions:
* Mac OS X 10.6.8
* Excel for Mac 2011 v.14.1.3
* Windows Server 2003 SP2
* Windows Excel 2002 v.10.2701.2625


On my Mac OS, Text Wrangler identified a CSV file created with Excel as having "Western" encoding.

After some googling I have made this small script (I am not sure about Windows availability, maybe with Cygwin?):

$ cat /usr/local/bin/utf8.sh

#!/bin/bash

INPUTFILE="$1"

iconv -f macroman -c -t UTF-8 $INPUTFILE |tr '\r' '\n' >/tmp/file.$$.csv

mv $INPUTFILE ms_trash
mv /tmp/file.$$.csv $INPUTFILE

In my case this worked (Mac, Excel 2011, both Cyrillic and Latin characters with Czech diacritics):

  • Charset UTF-16LE (simply UTF-16 was not enough)
  • BOM "\xFF\xFE"
  • \t (tab) as separator
  • Don't forget to encode also separator and CRLFs :-)
  • Use iconv instead of mb_convert_encoding

In my case adding Preamble to file solved my problem:

var data = Encoding.UTF8.GetBytes(csv);
var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
return File(new MemoryStream(result), "application/octet-stream", "data.csv");

instead of csv, trying outputting html with an XLS extension and "application/excel" mime-type. I know this will work in Windows, but can't speak for MacOS


This works for me

  1. Open the file in BBEdit or TextWrangler*.
  2. Set the file as Unicode (UTF-16 Little-Endian) (Line Endings can be Unix or Windows). Save!
  3. In Excel: Data > Get External Data > Import Text File...

Now the key point, choose MacIntosh as File Origin (it should be the first choice).

This is using Excel 2011 (version 14.4.2)

*There's a little dropdown at the bottom of the window


Solve this using java ( UTF-16LE with BOM ):

String csvReportStr = getCsvReport();
byte[] data = Charset.forName("UTF-16LE").encode(csvReportStr)
    .put(0, (byte) 0xFF)
    .put(1, (byte) 0xFE)
    .array();

Note that CSV file should use TAB as separator. You can read the CSV file both on windows and MAC OS X.

Refer to: How do I encode/decode UTF-16LE byte arrays with a BOM?

참고URL : https://stackoverflow.com/questions/6588068/which-encoding-opens-csv-files-correctly-with-excel-on-both-mac-and-windows

반응형