Python 문자열에서 HTML 제거

programing

Python 문자열에서 HTML 제거

prostudy 2022. 10. 3. 15:57

Python 문자열에서 HTML 제거

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

HTML 파일로 한 줄을 인쇄할 때 형식 자체가 아닌 각 HTML 요소의 내용만 표시할 수 있는 방법을 찾고 있습니다.「」를 '<a href="whatever.example">some text</a>' ' 텍스트 ' 텍스트', '어떤 텍스트', '어떤 텍스트', '어떤 텍스트 ''<b>hello</b>' 안녕하세요' 등을 출력합니다.떻게해 해???

HTML 태그를 제거하기 위해 항상 이 기능을 사용했습니다. Python stdlib만 필요하기 때문입니다.

Python 3의 경우:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Python 2의 경우:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

텍스트 처리를 위해 HTML 태그를 삭제해야 하는 경우 간단한 정규 표현으로 충분합니다.XSS 공격을 방지하기 위해 사용자 생성 HTML을 삭제하려는 경우 이 옵션을 사용하지 마십시오. 모든 태그 또는 트래킹을 제거하는 안전한 방법은 아닙니다.다음 정규식은 대부분의 HTML 태그를 상당히 안정적으로 제거합니다.

import re

re.sub('<[^<]+?>', '', text)

하지 못하는 regex를 합니다.<...> 또는 여러개로 되어 있습니다).+ 아닌 < . 。?찾을 수 있는 최소 문자열과 일치함을 의미합니다.를 들어, 「」를 지정하면,<p>Hello</p>일치합니다.<'p> ★★★★★★★★★★★★★★★★★」</p>?없으면 <..Hello..>.

「」의 경우.<html로 를 들어2 < 3 시퀀스)라고&..., 그 때문에 anyway anyway anyway anyway anyway anyway^<불필요할 수도 있습니다.

BeautifulSoup 기능을 사용할 수 있습니다.

from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.example">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.example">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text())
#or via attribute of Soup Object: print(soup.text)

다음과 같이 파서를 명시적으로 지정하는 것이 좋습니다.BeautifulSoup(html_str, features="html.parser")출력을 재현할 수 있도록 합니다.

쇼트 버전!

import re, cgi
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)

# Clean up anything else by escaping
ready_for_web = cgi.escape(no_tags)

정규식 소스: 마크업 세이프이 버전에서는 HTML 엔티티도 처리되지만 이 빠른 버전에서는 처리되지 않습니다.

왜 꼬리표를 떼고 그냥 두면 안 되는 거죠?

은 한 <i>italicizing</i>i둥둥 떠다니고 있다.하지만 임의로 입력해서 완전히 무해하게 만드는 것은 또 다른 문제입니다.에 있는 않은.<!-- 조정) 및 angle-displus(각도)blah <<<><blah는 온전합니다.)는 온전합니다.HTMLParser 버전은 닫히지 않은 코멘트 안에 있는 경우 완전한 태그를 그대로 둘 수도 있습니다.

템플릿이 다음과 같은 경우{{ firstname }} {{ lastname }}무슨 일입니까?firstname = '<a' ★★★★★★★★★★★★★★★★★」lastname = 'href="http://evil.example/">'@Mediros! 제외)가 통과합니다.이들 태그 자체가 완전한 태그는 아니기 때문입니다.일반 HTML 태그를 제거하는 것만으로는 충분하지 않습니다.

고 d dstrip_tags이 에 대한 답변의 표제 참조)은과 같은경고를

결과 문자열이 HTML 안전하다는 보장은 전혀 없습니다. 'NEVER'는 'NEVER'의 를 '로 마크합니다.strip_tags를 먼저한다.escape().

그들의 충고를 따르세요!

HTMLParser를 사용하여 태그를 제거하려면 여러 번 실행해야 합니다.

이 질문에 대한 상위 답을 회피하는 것은 쉽다.

다음 문자열(출처 및 토론):

<img<!-- --> src=x onerror=alert(1);//><!-- -->

때가 HTMLParser, HTMLParser를 수 .<img...>보여서 안 요.HTMLParser를 사용합니다.★★★★★★★★★★★★★만 빼고,

<img src=x onerror=alert(1);//>

이 문제는 2014년 3월 장고 프로젝트에서 드러났다.의 들들 their their their theirstrip_tags기본적으로 이 질문에 대한 상위 답변과 동일했습니다.새 버전은 기본적으로 다시 실행해도 문자열이 변경되지 않을 때까지 루프에서 실행됩니다.

# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

이 이 항상 '어느 때'의 가 되지 않습니다. 약약면면면면면면면면면면면면strip_tags().

2015년 3월 19일 갱신:1.4.20, 1.6.11, 1.7.7 및 1.8c1 이전 버전의 Django에는 버그가 있습니다.이러한 버전은 strip_tags() 함수에 무한 루프를 입력할 수 있습니다.상기 고정판을 재현하고 있습니다.자세한 것은 이쪽.

복사 또는 사용하기에 좋은 것

샘플 코드는 HTML 엔티티를 취급하지 않습니다.Django 및 Markup Safe 패키지 버전은 취급합니다.

샘플 코드는 사이트 간 스크립팅을 방지하기 위해 뛰어난 Markup Safe 라이브러리에서 추출한 것입니다.편리하고 빠릅니다(C가 네이티브 Python 버전으로 속도를 높였습니다).Google App Engine에 포함되어 Jinja2(2.7 이상), Mako, Pylons 등이 사용하고 있습니다.이것은 장고 1.7의 장고 템플릿과 쉽게 연동됩니다.

Django의 strip_tags나 최신 버전의 HTML 유틸리티도 좋지만, Markup Safe보다 편리하지 않습니다.이 파일에서는 필요한 파일을 복사할 수 있습니다.

태그를 거의 모두 제거해야 하는 경우 블리치 라이브러리가 좋습니다."사용자는 이탤릭체는 사용할 수 있지만 iframe은 만들 수 없습니다."와 같은 규칙을 적용할 수 있습니다.

태그 스트리퍼의 속성을 이해하십시오!흐릿한 테스트를 해봐!여기 제가 이 답을 찾기 위해 사용한 코드가 있습니다.

양지바른 메모 - 질문 자체는 콘솔로의 인쇄에 관한 것이지만, 이것은 "열에서 HTML 삭제"에 대한 구글의 상위 결과입니다.그래서 이 답변은 99%가 웹에 관한 것입니다.

태그를 제거하고 HTML 엔티티를 일반 텍스트로 디코딩하는 방법이 필요했습니다.다음 솔루션은 Eloff의 답변에 근거한 것입니다(엔티티를 제거하기 때문에 사용할 수 없었습니다).

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    """Converts HTML to plain text (stripping tags and converting entities).
    >>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
    'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

    "Plain text" doesn't mean result can safely be used as-is in HTML.
    >>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
    '<script>alert("Hello");</script>'

    Always use html.escape to sanitize text before using in an HTML context!

    HTMLParser will do its best to make sense of invalid HTML.
    >>> html_to_text('x < y &lt z <!--b')
    'x < y < z '

    Named entities are handled as per HTML 5.
    >>> html_to_text('&nosuchentity; &apos; ')
    "&nosuchentity; ' "
    """
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

간단한 테스트:

html = '<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print(repr(html_to_text(html)))

결과:

'Demo (¬ Δημώ)'

보안 노트:HTML 제거(HTML을 일반 텍스트로 변환)와 HTML 삭제(일반 텍스트를 HTML로 변환)를 혼동하지 마십시오.이 답변은 HTML을 삭제하고 엔티티를 일반 텍스트로 디코딩합니다.이렇게 하면 HTML 컨텍스트에서 결과를 안전하게 사용할 수 없습니다.

::<script>alert("Hello");</script> will will will will will will 로 변환됩니다.<script>alert("Hello");</script>이는 100% 올바른 동작이지만 결과적인 플레인텍스트가 HTML 페이지에 그대로 삽입되어 있는 경우에는 분명 충분하지 않습니다.

규칙은 어렵지 않습니다.HTML 출력에 일반 텍스트 문자열을 삽입할 때마다 HTML이 포함되어 있지 않은 것을 "알고 있다"고 해도(예를 들어 HTML 콘텐츠를 삭제했기 때문에) 항상 HTML을 이스케이프합니다(사용).

그러나 OP는 결과를 콘솔로 출력하는 것에 대해 질문했고, 이 경우 HTML 이스케이프는 필요하지 않습니다.ASCII 제어 문자는 불필요한 동작(특히 Unix 시스템에서)을 트리거할 수 있으므로 대신 삭제할 수 있습니다.

import re
text = html_to_text(untrusted_html_input)
clean_text = re.sub(r'[\0-\x1f\x7f]+', '', text)
# Alternatively, if you want to allow newlines:
# clean_text = re.sub(r'[\0-\x09\x0b-\x1f\x7f]+', '', text)
print(clean_text)

여기에는 다음과 같은 간단한 방법이 있습니다.

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
        if c == '<' and not quote:
            tag = True
        elif c == '>' and not quote:
            tag = False
        elif (c == '"' or c == "'") and tag:
            quote = not quote
        elif not tag:
            out = out + c

    return out

아이디어에 대해서는, http://youtu.be/2tu9LTDujbw 를 참조해 주세요.

http://youtu.be/HPkNPcYed9M?t=35s 에서 동작하고 있는 것을 확인할 수 있습니다.http://youtu.be/HPkNPcYed9M?t=35s

PS - 클래스(파이썬을 사용한 스마트 디버깅에 대해)에 관심이 있으시다면 http://www.udacity.com/overview/Course/cs259/CourseRev/1 링크를 드리겠습니다.무료입니다!

천만에요!:)

lxml.html 기반 솔루션(lxml은 네이티브 라이브러리이며 순수 python 솔루션보다 성능이 향상될 수 있습니다.

모듈을 설치하려면pip install lxml

모든 태그 제거

from lxml import html


## from file-like object or URL
tree = html.parse(file_like_object_or_url)

## from string
tree = html.fromstring('safe <script>unsafe</script> safe')

print(tree.text_content().strip())

### OUTPUT: 'safe unsafe safe'

미리 위생화된 HTML을 사용하여 모든 태그 제거(일부 태그 삭제)

from lxml import html
from lxml.html.clean import clean_html

tree = html.fromstring("""<script>dangerous</script><span class="item-summary">
                            Detailed answers to any questions you might have
                        </span>""")

## text only
print(clean_html(tree).text_content().strip())

### OUTPUT: 'Detailed answers to any questions you might have'

또, lxml.dll의 정확한 기능에 대해서는, http://lxml.de/lxmlhtml.html#cleaning-up-html 를 참조해 주세요.

텍스트로 변환하기 전에 삭제할 특정 태그를 더 제어해야 할 경우 원하는 옵션을 사용하여 사용자 지정 lxml Cleaner를 만듭니다.다음은 예를 제시하겠습니다.

cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )
sanitized_html = cleaner.clean_html(unsafe_html)

일반 텍스트 생성 방법을 사용자 정의하려면text_content():

from lxml.etree import tostring

print(tostring(tree, method='text', encoding='unicode'))

다음은 HTML 태그를 삭제하고 놀라울 정도로 빠른 라이브러리를 기반으로 HTML 엔티티를 디코딩하는 간단한 솔루션입니다.

from lxml import html

def strip_html(s):
    return str(html.fromstring(s).text_content())

strip_html('Ein <a href="">sch&ouml;ner</a> Text.')  # Output: Ein schöner Text.

엔티티 엔티티 등)를가 있는 &), Eloff의 답변에 handle_entityref 메서드를 추가했습니다.

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

HTML 태그를 모두 삭제하는 가장 쉬운 방법은 BeautifulSoup을 사용하는 것입니다.

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True))

승인된 답변의 코드를 시도했지만 "RuntimeError: maximum recursion depth exceeded"라는 메시지가 떴습니다.이것은 위의 코드 블록에서는 발생하지 않았습니다.

Beautiful Soup 패키지가 바로 이 기능을 제공합니다.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

여기에서는, 현재 인정되고 있는 회답과 유사한 솔루션을 소개합니다(https://stackoverflow.com/a/925630/95989),. 단, 내부 솔루션을 사용하고 있습니다.HTMLParser직접 분류(즉, 하위 분류 없음)하여 훨씬 간결하게 만듭니다.

def strip_text(텍스트):
부품 = [ ]파서 = HTMLParser()parser.farc_data = parts.farcesparser.feed(텍스트).return ' (부품)

여기 파이썬 3용 솔루션이 있습니다.

import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

완벽할지는 모르겠지만, 제 사용 사례는 해결되었고 단순해 보입니다.

하나의 프로젝트를 위해 HTML을 삭제해야 했지만 css와 js도 필요했습니다.그래서 나는 엘로프스의 대답을 변형시켰다.

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

다른 HTML 파서(lxml 또는 Beautiful Soup 등)를 사용할 수 있습니다.이것은 텍스트만을 추출하는 기능을 제공합니다.또는 행 문자열에서 태그를 삭제하는 regex를 실행할 수도 있습니다.자세한 내용은 Python 문서를 참조하십시오.

Eloff의 답변을 Python 3.1에 성공적으로 사용했습니다.[감사합니다]

Python 3.2.3으로 업그레이드하여 오류가 발생하였습니다.

응답자 Thomas K 덕분에 여기에 제공된 해결책은 다음과 같습니다.super().__init__()다음 코드로 변환합니다.

def __init__(self):
    self.reset()
    self.fed = []

...이렇게 하려면:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

Python 3.2.3에서는 동작합니다.

다시 한번 Thomas K의 수정과 위에 제공된 Eloff의 오리지널 코드에게 감사드립니다!

HTML-Parser를 사용하는 솔루션은 한 번만 실행되면 모두 중단될 수 있습니다.

html_to_text('<<b>script>alert("hacked")<</b>/script>

결과는 다음과 같습니다.

<script>alert("hacked")</script>

뭘 막으려고 하는지.HTML-Parser 를 사용하는 경우는, 0 이 치환될 때까지 Tags 를 카운트 합니다.

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

이것은 빠른 수정으로 더욱 최적화될 수 있지만 올바르게 작동합니다.이 코드는 빈 태그가 아닌 모든 태그를 "로 대체하고 지정된 입력 텍스트에서 모든 html 태그를 제거합니다../file.py 입력 출력을 사용하여 실행할 수 있습니다.

    #!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"

sören-lövborg의 답변을 파이썬 3으로 수정한 것

from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of http://stackoverflow.com/a/7778368/196732 """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

사용자 고유의 함수를 작성할 수 있습니다.

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text

저는 Github readmes를 해석하고 있는데, 다음 사항이 잘 작동한다는 것을 발견했습니다.

import re
import lxml.html

def strip_markdown(x):
    links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
    bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
    emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
    return emph_sub

def strip_html(x):
    return lxml.html.fromstring(x).text_content() if x else ''

그리고 나서.

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />

            sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
            It uses the asynchronous `asyncio` framework, as well as many popular modules 
            and extensions.

            Most importantly, it aims for **next generation** web crawling where machine intelligence 
            is used to speed up the development/maintainance/reliability of crawling.

            It mainly does this by considering the user to be interested in content 
            from *domains*, not just a collection of *single pages*
            ([templating approach](#templating-approach))."""

strip_markdown(strip_html(readme))

모든 마크다운 및 html을 올바르게 삭제합니다.

BeautifulSoup, html2text 또는 @Eloff의 코드를 사용하면 대부분의 경우 일부 html 요소, javascript 코드...

따라서 다음 라이브러리를 조합하여 마크다운 포맷(Python 3)을 삭제할 수 있습니다.

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

나한테는 잘 먹히지만, 물론 향상될 수 있어.

간단한 코드!이것에 의해, 그 안에 있는 모든 종류의 태그와 컨텐츠가 삭제됩니다.

def rm(s):
    start=False
    end=False
    s=' '+s
    for i in range(len(s)-1):
        if i<len(s):
            if start!=False:
                if s[i]=='>':
                    end=i
                    s=s[:start]+s[end+1:]
                    start=end=False
            else:
                if s[i]=='<':
                    start=i
    if s.count('<')>0:
        self.rm(s)
    else:
        s=s.replace('&nbsp;', ' ')
        return s

그러나 텍스트에 << 고객명 >> 기호가 포함되어 있으면 완전한 결과를 얻을 수 없습니다.

# This is a regex solution.
import re
def removeHtml(html):
  if not html: return html
  # Remove comments first
  innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
  while innerText.find('>')>=0: # Loop through nested Tags
    text = re.compile('<[^<>]+?>').sub('',innerText)
    if text == innerText:
      break
    innerText = text

  return innerText.strip()

2020년 갱신

Mozilla Blich 라이브러리를 사용하면 유지할 태그와 속성을 맞춤화할 수 있으며 값에 따라 속성을 필터링할 수도 있습니다.

여기에 설명해야 할 2가지 사례를 제시하겠습니다.

1) HTML 태그 및 속성을 허용하지 않음

샘플 미가공 텍스트

raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p> 
"""

2) 모든 HTML 태그와 속성을 raw 텍스트에서 삭제

# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

산출량

Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

3 srcset 속성을 가진 img 태그만 허용

from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

산출량

<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

이렇게 하는데 내가 뭘 하고 있는지 모르겠어.HTML 태그를 제거하여 HTML 테이블에서 데이터를 가져옵니다.

문자열 "name"을 사용하고 문자열 "name1"을 HTML 태그 없이 반환합니다.

x = 0
anglebrackets = 0
name1 = ""
while x < len(name):
    
    if name[x] == "<":
        anglebrackets = anglebrackets + 1
    if name[x] == ">":
        anglebrackets = anglebrackets - 1
    if anglebrackets == 0:
        if name[x] != ">":
            name1 = name1 + name[x]
    x = x + 1

import re

def remove(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

이 방법은 완벽하게 작동하며 추가 설치가 필요하지 않습니다.

import re
import htmlentitydefs

def convertentity(m):
    if m.group(1)=='#':
        try:
            return unichr(int(m.group(2)))
        except ValueError:
            return '&#%s;' % m.group(2)
        try:
            return htmlentitydefs.entitydefs[m.group(2)]
        except KeyError:
            return '&%s;' % m.group(2)

def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

html =  converthtml(html)
html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).

언급URL : https://stackoverflow.com/questions/753052/strip-html-from-strings-in-python

저작자표시

'programing' 카테고리의 다른 글

C에서의 템플릿 시뮬레이션(큐 데이터 타입의 경우) (0)	2022.10.03
MySQL의 기본 ON DELETE 동작은 무엇입니까? (0)	2022.10.03
비누 연장 설치 방법 (0)	2022.10.03
오류: '../build/debug/sqlclient' 모듈을 찾을 수 없습니다. (0)	2022.10.03
SELECT... GROUP BY 쿼리에서 반복 방지 (0)	2022.10.03

현재글Python 문자열에서 HTML 제거

각종 프로그래밍 정보를 다루는 블로그입니다.

java, react-redux, Python, C, vue-component, vuex, Vue-Router, Rxjs, python-3, vuetify, react-native, 전시, vuejs2, Reactjs, 공연, python-2, VUE, TypeScript, react-router, react-hooks,

Today :
Yesterday :

prostudy

Python 문자열에서 HTML 제거

Python 문자열에서 HTML 제거

쇼트 버전!

왜 꼬리표를 떼고 그냥 두면 안 되는 거죠?

HTMLParser를 사용하여 태그를 제거하려면 여러 번 실행해야 합니다.

복사 또는 사용하기에 좋은 것

모든 태그 제거

미리 위생화된 HTML을 사용하여 모든 태그 제거(일부 태그 삭제)

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Python 문자열에서 HTML 제거

Python 문자열에서 HTML 제거

쇼트 버전!

왜 꼬리표를 떼고 그냥 두면 안 되는 거죠?

HTMLParser를 사용하여 태그를 제거하려면 여러 번 실행해야 합니다.

복사 또는 사용하기에 좋은 것

모든 태그 제거

미리 위생화된 HTML을 사용하여 모든 태그 제거(일부 태그 삭제)

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바