问题描述
我有一个名为idf
的大字典(超过 1000 个条目),我想将它的所有values()
在一个压缩的 .txt 文件中。
这是我的代码
for key in idf:
data = str(idf[key])
compressed_index = zlib.compress(data.encode('ISO-8859-1'))
with open(current_inverted_index, "ab") as my_file:
my_file.write(compressed_index)
压缩结束后,我的新 .txt 文件的大小为(443MB),前几行如下所示:
xú??Vw∞∞4422–5000T∑R?6¨≠’Q@76??a?ò?d?
1?ò[?d??±?X?m16?≠??5%xú??;í$7D?2?úUI?+O7ê?–I∫?*??? e ?????H$????/?ˇ1W.%??R ????w???W??’r??’≥>??|Sè??3ü9???z?=_}ü??j~Fw[ˇ?????…?&/??/?3$?? ?m<O?–RüwVüOYs??ü?t?dl”‘??≥??a??ü+??T?]}???o???≈*?S”j5??'zπ,??ú}uΩy??g??UM;KM?k?2?b?…?S6z??°C?—≤Cf??‘?¨ ????zvΩ÷÷ü??–??@≈J?±?
?ê5??i3ü??≤áu-?a1?id???é(??5t?G?≈pY?>/ ???±-?π≠?pgùXBF?8≤Z?2∏??r?‘?M ?C3wY.??≤??%??I√≥?cJπ0∑?'?ê7??òM??$EP.Cèì?v^\?"h?.§O???m?cTN?A>??X??????áf??eú<R?#-)?6?%?≤??∏_‰?v?U&hM?l??5·I?4?F?`7???z???&??l{ à??–ê?5C9—ì ?<??“ó?x?&_??Qv?j?????og???4N?d&SZùwêf^5§**M???≤≥?;V"?-?g]ü??Z?]ú∏R??r ???‰ ????3>?’?X?:?v??CK??F????4:?ò?≠,?<?9'r?àπ1ê?i|∑π??∞?;
我正在尝试测试我的编码,但我只将字典中第一个键的第一个值作为b"[{'AP891220-0001': {1}}, {'AP891220-0034': {512}}, {'AP891220-0073': {311}}, {'AP891220-0078': {231}}, {'AP891220-0079': {137}}]"
这是我的解码代码:
f = open('inverted_indexes/id_1.txt', 'rb')
decompressed_data = zlib.decompress(f.read())
print(decompressed_data)
我不确定是什么问题以及为什么我只解码 .txt 文件的一小部分而不是所有内容
1楼
使用像pickle
(不安全)或json
这样的序列化库一次压缩整个字典:
import zlib
import pickle
index = 'index.txt'
idf = dict(zip('abcdefghijklmnop',range(16)))
compressed_index = zlib.compress(pickle.dumps(idf))
with open(index, 'wb') as my_file:
my_file.write(compressed_index)
with open(index, 'rb') as f:
decompressed_data = zlib.decompress(f.read())
print(pickle.loads(decompressed_data))
import zlib
import json
index = 'index.txt'
idf = dict(zip('abcdefghijklmnop',range(16)))
compressed_index = zlib.compress(json.dumps(idf).encode())
with open(index, 'wb') as my_file:
my_file.write(compressed_index)
with open(index, 'rb') as f:
decompressed_data = zlib.decompress(f.read()).decode()
print(json.loads(decompressed_data))