最近用pandas讀取stata然後在儲存成dta。首先讀取dta檔案,然後開始處理資料,最後儲存dta的時候,準備将原始資料檔案的variable_labels存進去。結果報錯
D:\anaconda\envs\tensorflow\lib\site-packages\pandas\io\stata.py in _write_variable_labels(self)
2251 is_latin1 = all(ord(c) < 256 for c in label)
2252 if not is_latin1:
-> 2253 raise ValueError('Variable labels must contain only '
2254 'characters that can be encoded in '
2255 'Latin-1')
ValueError: Variable labels must contain only characters that can be encoded in Latin-1
閱讀了一下源碼
def _write_variable_labels(self):
# Missing labels are 80 blank characters plus null termination
blank = _pad_bytes('', 81)
if self._variable_labels is None:
for i in range(self.nvar):
self._write(blank)
return
for col in self.data:
if col in self._variable_labels:
label = self._variable_labels[col]
if len(label) > 80:
raise ValueError('Variable labels must be 80 characters '
'or fewer')
is_latin1 = all(ord(c) < 256 for c in label)
if not is_latin1:
raise ValueError('Variable labels must contain only '
'characters that can be encoded in '
'Latin-1')
self._write(_pad_bytes(label, 81))
else:
self._write(blank)
可以看到,在寫dta時,首先會根據現有的dataframe的列索引找到傳入_variable_labels中對應的值,即label,然後發現中文的不能寫入值支援Latin-1,能讀不能寫,着實蛋疼。