问题描述
import bs4
from bs4 import BeautifulSoup as soup
import requests
import json
import pandas as pd
from urllib2 import urlopen as uo
from bs4 import BeautifulSoup as soup
from urllib2 import Request as ur
import numpy as np
from pandas import ExcelWriter
url = 'https://navbharattimes.indiatimes.com/movie-masti/movie-review/village-rockstars-movie-review-in-hindi/moviereview/65997258.cms'
request=ur(url,None,headers)
uC = uo(request)
html_read = uC.read()
uC.close()
html_soup = soup(html_read, 'lxml')
review = html_soup.findAll('div', class_ = 'Normal')
review
在上面的网络爬虫代码上运行后,我得到以下代码而不是原始文本输出..我的问题是如何将此 ascii 代码转换为文本。
这是网络爬虫数据。:-
\u0930\u0947\u0923\u0941\u0915\u093e \u0935\u094d\u092f\u0935\u0939\u093e\u0930\u0947<br/>\u0915\u0939\u093e\u0928\u0940:</strong> \u0905\u0938\u092e \u0915\u0947 \u090f\u0915 \u0916\u0942\u092c\u0938\u0942\u0930\u0924 \u0917\u093e\u0902\u0935 \u092e\u0947\u0902 \u0930\u0939\u0928\u0947 \u0935\u093e\u0932\u0940 \u0927\u0941\u0928\u0942 \u0915\u094b \u092a\u0947\u0921\u093c \u092a\u0930 \u091a\u0922\u093c\u0928\u093e, \u0932\u0921\u093c\u0915\u094b\u0902 \u0915\u0947 \u0938\u093e\u0925 \u0916\u0947\u0932\u0928\u093e \u0914\u0930 \u0905\u092a\u0928\u093e \u0925\u0930\u092e\u093e\u0915\u0949\u0932 \u0915\u093e \u0917\u093f\u091f\u093e\u0930 \u092b\u094d\u0932\u0949\u0928\u094d\u091f \u0915\u0930\u0928\u093e \u0915\u093e\u092b\u0940 \u0905\u091a\u094d\u091b\u093e \u0932\u0917\u0924\u093e \u0939\u0948\u0964 \u0935\u0939 \u0918\u0930 \u0915\u0947 \u0915\u093e\u092e\u094b\u0902 \u092e\u0947\u0902 \u0905\u092a\u0928\u0940 \u0935\u093f\u0927\u0935\u093e \u092e\u093e\u0902 \u0915\u093e \u0939\u093e\u0925 \u092d\u0940 \u092c\u091f\u093e\u0924\u0940 \u0939\u0948\u0964 \u0909\u0938\u0915\u093e \u0938\u092a\u0928\u093e \u0939\u0948 \u0915\u093f \u090f\u0915 \u0926\u093f\u0928 \u0909\u0938\u0915\u0947 \u092a\u093e\u0938 \u0905\u0938\u0932 \u0917\u093f\u091f\u093e\u0930 \u0939\u094b\u0964 \u0915\u094d\u092f\u093e \u0909\u0938\u0915\u093e \u092f\u0939 \u0938\u092a\u0928\u093e \u092a\u0942\u0930\u093e \u0939\u094b \u092a\u093e\u090f\u0917\u093e
1楼
你有Unicode
。
要查看检索到的内容,请尝试:
print review
这将产生类似的东西:
?????? ?????????????: ??? ?? ?? ??????? ???? ??? ???? ???? ???? ?? ???? ?? ?????, ?????? ?? ??? ????? ?? ???? ??????? ?? ????? ??????? ???? ???? ????? ???? ??? ?? ?? ?? ????? ??? ???? ????? ??? ?? ??? ?? ????? ??? ???? ???? ?? ?? ?? ??? ???? ??? ??? ????? ??? ???? ???? ?? ???? ???? ?? ?????
2楼
这是 unicode 数据,而不是 ascii。 并且必须正确编码和显示。
python3对unicode有更好的支持,如果你还没有使用它,请考虑切换。
您运行它的终端也应该能够处理/显示 unicode 数据,否则您会看到字符应该在的框。
编辑:文本是印地语,为了能够正确显示它,还必须在您的系统上安装字体。
编辑:这是我使用 python3 抓取相同内容的尝试:
In [1]: import requests
...: from lxml import etree
...:
...: url = 'https://navbharattimes.indiatimes.com/movi
...: e-masti/movie-review/village-rockstars-movie-revi
...: ew-in-hindi/moviereview/65997258.cms'
...:
...: r = requests.get(url)
...: tree = etree.HTML(r.text)
...:
...: all_divs = tree.xpath('//div[@class="Normal"]//te
...: xt()')
...:
...: text = ' '.join([i for i in all_divs if i.strip()
...: !=""])
...:
In [2]: text
Out[2]: "?????? ???????? ?????: ??? ?? ?? ??????? ???? ??? ???? ???? ???? ?? ???? ?? ?????, ?????? ?? ??? ????? ?? ???? ??????? ?? ????? ??????? ???? ???? ????? ???? ??? ?? ?? ?? ????? ??? ???? ????? ??? ?? ??? ?? ????? ??? ???? ???? ?? ?? ?? ??? ???? ??? ??? ????? ??? ???? ???? ?? ???? ???? ?? ??????\n ??????: ????-?????????-???????? ???? ??? ?? ???? ????? ' ????? ?????????? ' ???? ?? ?? ?? ????? 2019 ?? ??? ????? ????? ???????? ??????? ?? ??? ???? ?? ??? ?? ?????, ??????? ?? ????????? ?? ????? ?????? ?? ????? ????? ?? ?? ???? ???? ?? ?????? ???? ???????? ?? ?????? ??? ?? ?? ??? ?? ???? ?? ???? ?? ?? ???? ?? ??????? ????? ?? ??? ???? ??? ??? ????? ?? ???? ???? ???? ???????? ???? ???? ???, ???? ???? ???? ?? ??????? ?? ???? ?? ????? ???? ?? ???? ????? ??? ??? ???????? ?? ???? ?????? ?? ????? ???? ??? ??? ???? ??? ??? ?? ??????? ?? ???? ?? ????? ?? ?????? ???????????? ?? ?????????? ?? ???? ???? ???? ???? ?? ????? ?? ?????? ??? ?? ???? ???? \n ?? ??? ?????? ?? ???? ???? ??????? ?? ????? ?????? ??? ?? ??? ?????? ???? ???? ????????? ?? ?????? ?? ??? ???? ???? ?? ??? ????? ?? ??? ??????? ???? ??? ?? ???? ?? ?? ????? ??? ??????? ?? ????? ??? ???? ??? \n ?? ????? ????? ?? ???? ???? ???? ???????? ?? ??????? ??? ???? ??? ??????? ?? ????? ??? ?? ?? ??????? ?? ?????? ?? ?? ??? ?? ???? ???? ????? ??? \n ?????? ?????? ?? ??????? ?? ????? ???? ?? ???? ?? ???? ????? ?? ?????? ?????? ?????? ?? ???? ?? ????? ????? ?? ????? ??? ???? ???? ??? ??? ???? ?? ??? ???? ???? ?? ?????? ???? ?? ??? ?? ???? ??? ?? ???? ?? ??? ???? ?? ?? ??? '????? ?? ???' ???? ?? ????? ???? ???? ???? ?? ????? ?? ??? ????? ?? ??????? ?????? ???????, ??????? ?? ?????? ???????? ?? ????? ??? ????? ?? ??? ?? ???? ????? ?? ??? ?? ????? ????? ??? \n ????? ?????????? ???? ?? ?? ??? ??? ?????? ?? ?? ?? ???????? ?? ???? ??? ?? ?? ???? ????? ?? ???? ??? ?? ??????? ?? ????? ?? ?????? ?? ??????? ?? ????? ?? ????? ?? ????? ???\n ??????: X"