天天看点

python读写stata 的一个坑

最近用pandas读取stata然后在保存成dta。首先读取dta文件,然后开始处理数据,最后保存dta的时候,准备将原始数据文件的variable_labels存进去。结果报错

D:\anaconda\envs\tensorflow\lib\site-packages\pandas\io\stata.py in _write_variable_labels(self)
   2251                 is_latin1 = all(ord(c) < 256 for c in label)
   2252                 if not is_latin1:
-> 2253                     raise ValueError('Variable labels must contain only '
   2254                                      'characters that can be encoded in '
   2255                                      'Latin-1')

ValueError: Variable labels must contain only characters that can be encoded in Latin-1
           

阅读了一下源码

def _write_variable_labels(self):
        # Missing labels are 80 blank characters plus null termination
        blank = _pad_bytes('', 81)

        if self._variable_labels is None:
            for i in range(self.nvar):
                self._write(blank)
            return

        for col in self.data:
            if col in self._variable_labels:
                label = self._variable_labels[col]
                if len(label) > 80:
                    raise ValueError('Variable labels must be 80 characters '
                                     'or fewer')
                is_latin1 = all(ord(c) < 256 for c in label)
                if not is_latin1:
                    raise ValueError('Variable labels must contain only '
                                     'characters that can be encoded in '
                                     'Latin-1')
                self._write(_pad_bytes(label, 81))
            else:
                self._write(blank)
           

可以看到,在写dta时,首先会根据现有的dataframe的列索引找到传入_variable_labels中对应的值,即label,然后发现中文的不能写入值支持Latin-1,能读不能写,着实蛋疼。

继续阅读