当前位置: 代码迷 >> python >> 如何切片json文件,仅提取部分字段
  详细解决方案

如何切片json文件,仅提取部分字段

热度:54   发布时间:2023-06-13 20:19:53.0

我正在尝试切片json文件,该文件如下所示:

{"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"}
{"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77}
{"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}

您可以看到这些字段并未严格排在所有行中,我只需要部分字段。 所以我写了这段代码:

import json, csv

infile = open("sample_output.strict", "r")
outfile = open("output.csv", "w")
writer = csv.writer(outfile)

fileds = ["asin","price"]    
for product in json.loads(infile.read()):
    line = []
    for f in fields:
        if product.has_key(f):
            line.append(product[f])
        else:
            line.append("")
        writer.write(line)

我收到以下错误消息:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-3e335b184eea> in <module>()
      6 
      7 fileds = ["asin","price"]
----> 8 for product in json.loads(infile.read()):
      9     line = []
     10     for f in fields:

C:\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    316             parse_int is None and parse_float is None and
    317             parse_constant is None and object_pairs_hook is None and not kw):
--> 318         return _default_decoder.decode(s)
    319     if cls is None:
    320         cls = JSONDecoder

C:\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
    344         end = _w(s, end).end()
    345         if end != len(s):
--> 346             raise ValueError(errmsg("Extra data", s, end, len(s)))
    347         return obj
    348 

ValueError: Extra data: line 2 column 1 - line 3 column 617 (char 339 - 1581) 

您所拥有的是json行,而不是单个json文档。 更改程序以读取每一行并将其转换为json,然后以这种方式查看每个文档。 这实际上很常见,我一直以这种格式接收要加载的数据。

如果最终以任何方式处理大型文件,逐行执行此操作将节省大量内存。

import json, csv

with open("sample_output.strict", "r") as infile:
    with open("output.csv", "w") as outfile:
        writer = csv.writer(outfile)

        fields = ["asin","price"]  
        for json_line in infile:  
            product = json.loads(json_line)
            line = []
            for f in fields:
                if product.has_key(f):
                    line.append(product[f])
                else:
                    line.append("")
            writer.writerow(line)

您输入的json文件格式错误。 这就是您看到此错误的原因。 简而言之,单个文件中不能有多个JSON“对象”。 但是,在您的情况下,有3个哈希值。 一种解决方案是将它们包含在顶层列表中,如下所示:

[
    {"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"},
    {"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77},
    {"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}
]

然后,您可以使用以下代码片:

import json, csv
infile = open("sample_output.strict", "r")
jsondata = json.loads(infile.read())
outfile = open("output.csv", "w")
writer = csv.writer(outfile)
fields = ["asin","price"]
for product in jsondata:
    line = []
    for f in fields:
        if f in product:
            line.append(product)
            break   # I assume you need to print only once per match!?
        else:
            line.append("")
    writer.write(line)

我不明白您要使用csv输出做什么,因此我只是照原样复制了它,以演示此修复程序。

  相关解决方案