当前位置: 代码迷 >> python >> 如何解码网络抓取的数据..?
  详细解决方案

如何解码网络抓取的数据..?

热度:93   发布时间:2023-07-16 09:41:00.0
import bs4
from bs4 import BeautifulSoup as soup
import requests
import json
import pandas as pd
from urllib2 import urlopen as uo
from bs4 import BeautifulSoup as soup
from urllib2 import Request as ur
import numpy as np
from pandas import ExcelWriter
url = 'https://navbharattimes.indiatimes.com/movie-masti/movie-review/village-rockstars-movie-review-in-hindi/moviereview/65997258.cms'
        
request=ur(url,None,headers)
uC = uo(request)
html_read = uC.read()
uC.close()
html_soup = soup(html_read, 'lxml')
review = html_soup.findAll('div', class_ = 'Normal')
review

在上面的网络爬虫代码上运行后,我得到以下代码而不是原始文本输出..我的问题是如何将此 ascii 代码转换为文本。

这是网络爬虫数据。:-

\u0930\u0947\u0923\u0941\u0915\u093e \u0935\u094d\u092f\u0935\u0939\u093e\u0930\u0947<br/>\u0915\u0939\u093e\u0928\u0940:</strong> \u0905\u0938\u092e \u0915\u0947 \u090f\u0915 \u0916\u0942\u092c\u0938\u0942\u0930\u0924 \u0917\u093e\u0902\u0935 \u092e\u0947\u0902 \u0930\u0939\u0928\u0947 \u0935\u093e\u0932\u0940 \u0927\u0941\u0928\u0942 \u0915\u094b \u092a\u0947\u0921\u093c \u092a\u0930 \u091a\u0922\u093c\u0928\u093e, \u0932\u0921\u093c\u0915\u094b\u0902 \u0915\u0947 \u0938\u093e\u0925 \u0916\u0947\u0932\u0928\u093e \u0914\u0930 \u0905\u092a\u0928\u093e \u0925\u0930\u092e\u093e\u0915\u0949\u0932 \u0915\u093e \u0917\u093f\u091f\u093e\u0930 \u092b\u094d\u0932\u0949\u0928\u094d\u091f \u0915\u0930\u0928\u093e \u0915\u093e\u092b\u0940 \u0905\u091a\u094d\u091b\u093e \u0932\u0917\u0924\u093e \u0939\u0948\u0964 \u0935\u0939 \u0918\u0930 \u0915\u0947 \u0915\u093e\u092e\u094b\u0902 \u092e\u0947\u0902 \u0905\u092a\u0928\u0940 \u0935\u093f\u0927\u0935\u093e \u092e\u093e\u0902 \u0915\u093e \u0939\u093e\u0925 \u092d\u0940 \u092c\u091f\u093e\u0924\u0940 \u0939\u0948\u0964 \u0909\u0938\u0915\u093e \u0938\u092a\u0928\u093e \u0939\u0948 \u0915\u093f \u090f\u0915 \u0926\u093f\u0928 \u0909\u0938\u0915\u0947 \u092a\u093e\u0938 \u0905\u0938\u0932 \u0917\u093f\u091f\u093e\u0930 \u0939\u094b\u0964 \u0915\u094d\u092f\u093e \u0909\u0938\u0915\u093e \u092f\u0939 \u0938\u092a\u0928\u093e \u092a\u0942\u0930\u093e \u0939\u094b \u092a\u093e\u090f\u0917\u093e

你有Unicode 要查看检索到的内容,请尝试:

print review

这将产生类似的东西:

?????? ?????????????: ??? ?? ?? ??????? ???? ??? ???? ???? ???? ?? ???? ?? ?????, ?????? ?? ??? ????? ?? ???? ??????? ?? ????? ??????? ???? ???? ????? ???? ??? ?? ?? ?? ????? ??? ???? ????? ??? ?? ??? ?? ????? ??? ???? ???? ?? ?? ?? ??? ???? ??? ??? ????? ??? ???? ???? ?? ???? ???? ?? ?????

这是 unicode 数据,而不是 ascii。 并且必须正确编码和显示。

  • python3对unicode有更好的支持,如果你还没有使用它,请考虑切换。

  • 您运行它的终端也应该能够处理/显示 unicode 数据,否则您会看到字符应该在的框。


编辑:文本是印地语,为了能够正确显示它,还必须在您的系统上安装字体。


编辑:这是我使用 python3 抓取相同内容的尝试:

In [1]: import requests 
   ...: from lxml import etree 
   ...:  
   ...: url = 'https://navbharattimes.indiatimes.com/movi
   ...: e-masti/movie-review/village-rockstars-movie-revi
   ...: ew-in-hindi/moviereview/65997258.cms' 
   ...:  
   ...: r = requests.get(url) 
   ...: tree = etree.HTML(r.text) 
   ...:  
   ...: all_divs = tree.xpath('//div[@class="Normal"]//te
   ...: xt()') 
   ...:  
   ...: text = ' '.join([i for i in all_divs if i.strip()
   ...: !=""]) 
   ...:                                                  

In [2]: text                                             
Out[2]: "?????? ???????? ?????:  ??? ?? ?? ??????? ???? ??? ???? ???? ???? ?? ???? ?? ?????, ?????? ?? ??? ????? ?? ???? ??????? ?? ????? ??????? ???? ???? ????? ???? ??? ?? ?? ?? ????? ??? ???? ????? ??? ?? ??? ?? ????? ??? ???? ???? ?? ?? ?? ??? ???? ??? ??? ????? ??? ???? ???? ?? ???? ???? ?? ??????\n ??????:  ????-?????????-????????  ???? ???  ?? ???? ????? ' ????? ?????????? ' ???? ?? ?? ?? ????? 2019 ?? ??? ????? ????? ???????? ??????? ?? ??? ???? ?? ??? ?? ?????, ??????? ?? ????????? ?? ????? ?????? ?? ????? ????? ?? ?? ???? ???? ?? ?????? ???? ???????? ?? ?????? ??? ?? ?? ??? ?? ???? ?? ???? ?? ?? ???? ?? ??????? ????? ?? ??? ???? ??? ??? ????? ?? ???? ???? ???? ???????? ???? ???? ???, ???? ???? ???? ?? ??????? ?? ???? ?? ????? ???? ?? ???? ????? ??? ??? ???????? ?? ???? ?????? ?? ????? ???? ??? ??? ???? ??? ??? ?? ??????? ?? ???? ?? ????? ?? ?????? ???????????? ?? ?????????? ?? ???? ???? ???? ???? ?? ????? ?? ?????? ??? ?? ???? ???? \n ?? ??? ?????? ?? ???? ???? ??????? ?? ????? ?????? ??? ?? ??? ?????? ???? ???? ????????? ?? ?????? ?? ??? ???? ???? ?? ??? ????? ?? ??? ??????? ???? ??? ?? ???? ?? ?? ????? ??? ??????? ?? ????? ??? ???? ??? \n ?? ????? ????? ?? ???? ???? ???? ???????? ?? ??????? ??? ???? ??? ??????? ?? ????? ??? ?? ?? ??????? ?? ?????? ?? ?? ??? ?? ???? ???? ????? ??? \n ?????? ?????? ?? ??????? ?? ????? ???? ?? ???? ?? ???? ????? ?? ?????? ?????? ?????? ?? ???? ?? ????? ????? ?? ????? ??? ???? ???? ??? ??? ???? ?? ??? ???? ???? ?? ?????? ???? ?? ??? ?? ???? ??? ?? ???? ?? ??? ???? ?? ?? ??? '????? ?? ???' ???? ?? ????? ???? ???? ???? ?? ????? ?? ??? ????? ?? ??????? ?????? ???????, ??????? ?? ?????? ???????? ?? ????? ??? ????? ?? ??? ?? ???? ????? ?? ??? ?? ????? ????? ??? \n ????? ?????????? ???? ?? ?? ??? ??? ?????? ?? ?? ?? ???????? ?? ???? ??? ?? ?? ???? ????? ?? ???? ??? ?? ??????? ?? ????? ?? ?????? ?? ??????? ?? ????? ?? ????? ?? ????? ???\n ??????: X"
  相关解决方案