질문과 답변

Q. URI에 한국어가 들어가면 어떻게 되는가?

utf-8은 최대 1,112,064개의 문자를 표현할 수 있다.

RFC 3986에는 UTF-8에 해당하는 문자의 변환에 대해 이렇게 다루고 있다.

  When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.  For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".

먼저, 어떤 URI 컴포넌트가 UCS에 속한 textual data를 다룰 때 그것은 UTF-8을 따르는 octet으로 표현되어야 한다.
그후, 이 octet 중 url의 unreserved set에 포함되지 않는 문자만 percent-encode 된다.

한국어 변환 과정

한 을 utf-8 table에서 찾아보면 다음과 같다.
MARC-8 UCS UTF-8 CHAR NAME 6F5C65 D55C ED959C 한 Korean hangul
테이블에 따르면, 한 의 UCS code point는 D55C이다.
이는 binary로 1101 0101 0101 1100 이다.
이것을 utf-8으로 변환하면 3bytes로 표현할 수 있다.
이것을 표현하면 11101101 10010101 10011100 이다.
8bit의 묶음을 octet이라 표현하는데, 이것들을 hexadecimal로 표현하면 ED959C 이다.
이제 이것을 percent encoding해서 표현하면 %ED%95%9C 가 된다.

percent encoding

reserved character가 special character sequence를 이용해 URI에 표현될 수 있다.
예를 들어, / 는 반드시 percent encode해서 표현되어야만 하며, %2F 로 표현된다.
utf-8에 해당하는 문자들도 당연히 percent encode 되어야만 한다.

references

Q. 잘못된 charset 을 사용하면 발생할 수 있는 이슈?

대표적으로 많이 쓰이는 charset 은 UTF-8
UTF-8 은 유효하지 않은 byte 에 대해 replacement character (�) 로 변환하길 권장하는 스펙을 갖고 있음
잘못된 인코더/디코더 사용은 데이터 손실을 발생시킬 수 있다.
- encode(decode(data)) !== data 일 수 있다.

실제로 데이터 손실이 발생하는지?

0부터 255 로 매핑된 character 들에 대해 Javascript TextDecoder API 로 디코딩 했을 때 같은 글자로 변환 (replacement character 으로 추정) 되는 문자가 몇개나 되는지 알아본다.

js code

let bufferView = new Uint8Array(256);
for (let i = 0; i < 256; i++) {
  bufferView[i] = i;
}

let decodedText = new TextDecoder('utf-8').decode(bufferView);
let decodedTextMap = new Map();

for (let i = 0; i < decodedText.length; i++) {
  let charCode = decodedText.charCodeAt(i);
  let val = decodedTextMap.get(charCode);
  if (val === undefined) {
    decodedTextMap.set(charCode, 1);
  } else {
    decodedTextMap.set(charCode, val+1);
  }
}

for (key of decodedTextMap.keys()) {
  let val = decodedTextMap.get(key);
  if (val > 1) {
    console.log(`number of (charCode:${key}, char:${String.fromCharCode(key)}) is ${val}`);
  }
}

result

number of (charCode:65533, char:�) is 128

데이터 손실을 발생시키지 않는 charset 도 있는지?

인코딩 된 후 문자의 가짓수가 256개가 되는 charset 이 있는지 알아본다.

js code

let bufferView = new Uint8Array(256);
for (let i = 0; i < 256; i++) {
  bufferView[i] = i;
}

let decoding = [
  'utf-8',          'ibm866',       'iso-8859-2',     'iso-8859-3',     'iso-8859-4',
  'iso-8859-5',     'iso-8859-6',   'iso-8859-7',     'iso-8859-8',     'iso-8859-8i',
  'iso-8859-10',    'iso-8859-13',  'iso-8859-14',    'iso-8859-15',    'iso-8859-16',
  'koi8-r',         'koi8-u',       'macintosh',      'windows-874',    'windows-1250',
  'windows-1251',   'windows-1252', 'windows-1253',   'windows-1254',   'windows-1255',
  'windows-1256',   'windows-1257', 'windows-1258',   'x-mac-cyrillic', 'gbk',
  'gb18030',        'hz-gb-2312',   'big5',           'euc-jp',         'iso-2022-jp',
  'shift-jis',      'euc-kr',       'iso-2022-kr',    'utf-16be',       'utf-16le',
  'x-user-defined', 'ISO-2022-CN',  'ISO-2022-CN-ext'];

for (let dec of decoding) {
  try {
    let decodedText = new TextDecoder(dec).decode(bufferView);
    let decodedTextSet = new Set();
    for (let i = 0; i < decodedText.length; i++) {
      let charCode = decodedText.charCodeAt(i);
      decodedTextSet.add(charCode);
    }
    console.log(`[${dec}] : ${decodedTextSet.size} / ${256} can be represented.`);
  } catch (e) {
    console.log(`!! [${dec}] : not supported.`);
  }
}

result

[utf-8] : 129 / 256 can be represented.
[ibm866] : 256 / 256 can be represented.
[iso-8859-2] : 256 / 256 can be represented.
[iso-8859-3] : 250 / 256 can be represented.
[iso-8859-4] : 256 / 256 can be represented.
[iso-8859-5] : 256 / 256 can be represented.
[iso-8859-6] : 212 / 256 can be represented.
[iso-8859-7] : 254 / 256 can be represented.
[iso-8859-8] : 221 / 256 can be represented.
!! [iso-8859-8i] : not supported.
[iso-8859-10] : 256 / 256 can be represented.
[iso-8859-13] : 256 / 256 can be represented.
[iso-8859-14] : 256 / 256 can be represented.
[iso-8859-15] : 256 / 256 can be represented.
[iso-8859-16] : 256 / 256 can be represented.
[koi8-r] : 256 / 256 can be represented.
[koi8-u] : 256 / 256 can be represented.
[macintosh] : 256 / 256 can be represented.
[windows-874] : 249 / 256 can be represented.
[windows-1250] : 256 / 256 can be represented.
[windows-1251] : 256 / 256 can be represented.
[windows-1252] : 256 / 256 can be represented.
[windows-1253] : 254 / 256 can be represented.
[windows-1254] : 256 / 256 can be represented.
[windows-1255] : 247 / 256 can be represented.
[windows-1256] : 256 / 256 can be represented.
[windows-1257] : 255 / 256 can be represented.
[windows-1258] : 256 / 256 can be represented.
[x-mac-cyrillic] : 256 / 256 can be represented.
[gbk] : 193 / 256 can be represented.
[gb18030] : 192 / 256 can be represented.
!! [hz-gb-2312] : not supported.
[big5] : 177 / 256 can be represented.
[euc-jp] : 169 / 256 can be represented.
[iso-2022-jp] : 126 / 256 can be represented.
[shift-jis] : 220 / 256 can be represented.
[euc-kr] : 189 / 256 can be represented.
!! [iso-2022-kr] : not supported.
[utf-16be] : 127 / 256 can be represented.
[utf-16le] : 127 / 256 can be represented.
[x-user-defined] : 256 / 256 can be represented.
!! [ISO-2022-CN] : not supported.
!! [ISO-2022-CN-ext] : not supported.

결론

replacement character 로 변환될 수도 있구나 하고 말면 될 듯...

Q. IDN homograph attack

위키피디아 https://en.wikipedia.org/wiki/IDN_homograph_attack 에 대한 대충 번역 및 정리입니다. ref: https://no-hanja-domain.github.io/technical.html

Internationalized domain name homograph attack 국제 도메인을 이용한 homograph(한국말로 동형이의어) 공격 방법

비슷하게 생겼는데 다른 문자인 homoglyph를 이용한다.
그런데 우리 책에서 glyph면 같은 문자여야 한다고 하지 않았나...
wikipediа.org가 wikipedia.org인 것처럼 속인다. 첫 번째 а는 라틴 문자가 아니고 키릴 문자이다.
script spoofing이라고도 부른다. 스푸핑 공격이라고 하는 건 뭔가 속이는 공격의 총칭: 예를 들어 서버한테는 내가 클라인척 속이고 클라한테는 내가 서버인척 속이고.
typosquatting이랑도 비슷한데, 이건 말 그대로 typo. 치는 사람이 바보라서 잘못 치는 거임. ex. wikepidea.org

역사

컴퓨터 하는 사람들은 항상 l, 1, O, 0 같은 것들이 헷갈려 해왔다... 타자기 치던 시절에는 귀찮아서 영어 l 치는 데 1 넣기도 하고, 0이랑 O랑 헷갈려서 0에 빗금 치기 시작한 거다.
엄청 옛날 사람들도 헷갈려 했다. 아랍어 samt는 zenith로 번역됐는데 원래는 zemth가 되어야 했던 게 m하고 ni하고 헷갈려서 그랬다.
사람들은 재미로도 이런 짓을 한다. ex. ㄴr는ㄱト끔..

서버 대응

ICANN이라는 단체(또는 하위의 registry operator 단체)에서 최상위 도메인(TLD, top level domain) 정책을 정함. 여기서는 이미 어떤 TLD가 있을 때, 걔랑 헷갈리는 문자를 도메인으로 등록 못하게 해놓음.

클라이언트 대응

브라우저들은 아예 IDN을 막아 버리거나, IDNA 사이트를 차단해 버리거나, Punycode로 보여주는 방법을 사용한다. punycode는 한글 같은 것들을 유니코드로 바꿔주는 알고리즘. 예를 들어 wikipediа는 xn--wikipedi-86g로 바뀜.

파이어폭스: 최상위 도메인에서 IDNA를 막거나, 여러 언어가 섞여 있지 않은 경우면 IDN을 보여 준다. 그렇지 않으면 punycode로 바꿔서 보여 준다.
구글 크롬: 어떤 IDN이 한 가지 언어로만 되어 있고, 그 언어가 유저가 설정한 언어라면, IDN을 보여 줬다. 근데 이제는 파이어폭스랑 비슷한 알고리즘을 씀.
사파리: punycode로 보여준다.
익스플로러, 오페라: 굳이 언급하기 귀찮고 위에랑 비슷함

Previous16. 국제화 Next17. 내용 협상과 트랜스코딩

Last updated 6 years ago

hashtagQ. URI에 한국어가 들어가면 어떻게 되는가?

hashtag한국어 변환 과정

hashtagpercent encoding

hashtagreferences

hashtagQ. 잘못된 charset 을 사용하면 발생할 수 있는 이슈?

hashtag실제로 데이터 손실이 발생하는지?

hashtag데이터 손실을 발생시키지 않는 charset 도 있는지?

hashtag결론

hashtagQ. IDN homograph attack

hashtag역사

hashtag서버 대응

hashtag클라이언트 대응