天天看點

python讀寫stata 的一個坑

最近用pandas讀取stata然後在儲存成dta。首先讀取dta檔案,然後開始處理資料,最後儲存dta的時候,準備将原始資料檔案的variable_labels存進去。結果報錯

D:\anaconda\envs\tensorflow\lib\site-packages\pandas\io\stata.py in _write_variable_labels(self)
   2251                 is_latin1 = all(ord(c) < 256 for c in label)
   2252                 if not is_latin1:
-> 2253                     raise ValueError('Variable labels must contain only '
   2254                                      'characters that can be encoded in '
   2255                                      'Latin-1')

ValueError: Variable labels must contain only characters that can be encoded in Latin-1
           

閱讀了一下源碼

def _write_variable_labels(self):
        # Missing labels are 80 blank characters plus null termination
        blank = _pad_bytes('', 81)

        if self._variable_labels is None:
            for i in range(self.nvar):
                self._write(blank)
            return

        for col in self.data:
            if col in self._variable_labels:
                label = self._variable_labels[col]
                if len(label) > 80:
                    raise ValueError('Variable labels must be 80 characters '
                                     'or fewer')
                is_latin1 = all(ord(c) < 256 for c in label)
                if not is_latin1:
                    raise ValueError('Variable labels must contain only '
                                     'characters that can be encoded in '
                                     'Latin-1')
                self._write(_pad_bytes(label, 81))
            else:
                self._write(blank)
           

可以看到,在寫dta時,首先會根據現有的dataframe的列索引找到傳入_variable_labels中對應的值,即label,然後發現中文的不能寫入值支援Latin-1,能讀不能寫,着實蛋疼。

繼續閱讀