Python實作分層抽樣

2023-03-06 15:05:04

首先說明一下我的需求。result_33.txt的檔案中有一些号碼标簽及分類的标記。具體如下

中國農業銀行 銀行
招商銀行信用卡 信用卡
門窗 無關

我想做的是從每一個類标記中随機抽出1000個标簽。如果該類标簽下的樣本數不足1000，則全部抽取。然後将抽取的結果儲存到另一個檔案中。

具體代碼如下(盡可能減少記憶體占用量，但是檔案掃描次數太多)：

import random
if __name__ == '__main__':
    data={}
    with open("result_33.txt",'r',encoding='utf-8-sig') as r_dict,\
        open('check_result_33.txt','w',encoding='utf-8') as w_dict:
        for line in r_dict:
            items=line.strip('\n').split()
            data[items[1]]=data.get(items[1],0)+1
        for key,value in data.items():
            r_dict.seek(0)  # 傳回檔案開始位置
            if value<1000:
                for line in r_dict:
                    if key==line.strip('\n').split()[1]:
                        w_dict.write(line)
            else:
                rand_list=list(range(value))
                random.shuffle(rand_list)
                rand_list=iter(rand_list)
                for line in r_dict:
                    if key==line.strip('\n').split()[1] and next(rand_list)<1000:
                        w_dict.write(line)

Python實作分層抽樣

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入