(PYTHON) Day - 23 Regex and Parsing(3)

작성일 2020-03-14 In LANGUAGE 🚀 , PYTHON , HACKERRANK 댓글:

Reference

문제 출처 - HackerRank
파이썬 연습 - Practice - Python

개인적인 생각과 상상으로 작성한 내용들이 포함되어 있습니다
문제를 풀고 Discussion Tab을 참고하며 코드 스타일을 개선하려고 노력하고자 합니다

HackerRank

HackerRank의 Python 연습문제들은 아래와 같은 카테고리로 분류 된다

Subdomain

- ~~Introduction~~
- ~~Basic Data Types~~
- ~~Strings~~
- ~~Sets~~
- ~~Math~~
- ~~Itertools~~
- ~~Collections~~
- ~~Date and Time~~
- ~~Errors and Exceptions~~
- ~~Classes~~
- ~~Built-Ins~~
- ~~Python Functionals~~
- <strong style="color:blue">Regex and Parsing</strong>
- XML
- Closures and Decorators
- Numpy
- Debugging

Regex and Parsing

Problem

Hex Color Code
HTML Parser - Part 1
HTML Parser - Part 2
Detect HTML Tags, Attributes and Attribute Values
Validating UID
Validating Credit Card Numbers
Validating Postal Codes
Matrix Script

Hex Color Code

문제 : CSS 코드에서 HEX color code를 찾는 문제
입력 : 코드줄 수 N; (N 반복) CSS 코드
출력 : HEX color code

INPUT
OUTPUT

input

11
#BED
{
color: #FfFdF8; background-color:#aef;
font-size: 123px;
background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
background-color: #ABC;
border: 2px dashed #fff;
}

output

#FfFdF8
#aef
#f9f9f9
#fff
#ABC
#fff

import re

for \_ in range(int(input())):
HEX = re.findall(r".(#[0-9A-Fa-f]{6}|#[0-9A-Fa-f]{3})", input())
if HEX:
print(\*HEX, sep='\n')

HTML Parser - Part 1

문제 : html 문서를 태그별로 구분하는 문제

INPUT
OUTPUT

input

2

  <html><head><title>HTML Parser - I</title></head>
  <body data-modal-target class='1'><h1>HackerRank</h1><br /></body></html>

output

Start : html
Start : head
Start : title
End : title
End : head
Start : body
-> data-modal-target > None
-> class > 1
Start : h1
End : h1
Empty : br
End : body
End : html

end 태그 출력문에서 공백이 3개 있어야한다…

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start :", tag)
for e in attrs:
print ('->',e[0],'>',e[1])

      def handle_endtag(self, tag):
          print("End   :", tag)

      def handle_startendtag(self, tag, attrs):
          print("Empty :", tag)
          for e in attrs:
              print ('->',e[0],'>',e[1])

parser = MyHTMLParser()
parser.feed(''.join([input() for _ in range(int(input()))]))

HTML Parser - Part 2

문제 : 주석과 데이터를 구분하는 문제

INPUT
OUTPUT

input

4

  <!--[if IE 9]>IE9-specific content
  <![endif]-->
  <div> Welcome to HackerRank</div>
  <!--[if IE 9]>IE9-specific content<![endif]-->

output


> > > Multi-line Comment
> > > [if IE 9]>IE9-specific content
> > > <![endif]
> > > Data
> > > Welcome to HackerRank
> > > Single-line Comment
> > > [if IE 9]>IE9-specific content<![endif]

handle_comment 와 handle_data 를 정의하는 문제

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_comment(self,data):
if '\n' in data:
print('>>> Multi-line Comment')
print(data)
else:
print('>>> Single-line Comment')
print(data)

      def handle_data(self,data):
          if data!='\n':
              print('>>> Data')
              print(data)

html = ""  
 for i in range(int(input())):
html += input().rstrip()
html += '\n'

parser = MyHTMLParser()
parser.feed(html)
parser.close()

Detect HTML Tags, Attributes and Attribute Values

문제 : 태그, 속성, 속성값을 구분하는 문제

INPUT
OUTPUT

input

9

  <head>
  <title>HTML</title>
  </head>
  <object type="application/x-flash"
    data="your-file.swf"
    width="0" height="0">
    <!-- <param name="movie" value="your-file.swf" /> -->
    <param name="quality" value="high"/>
  </object>

output

head
title
object
-> type > application/x-flash
-> data > your-file.swf
-> width > 0
-> height > 0
param
-> name > quality
-> value > high

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(tag)
[print('-> {} > {}'.format(*attr)) for attr in attrs]

html = '\n'.join([input() for _ in range(int(input()))])
parser = MyHTMLParser()
parser.feed(html)
parser.close()

정규식만 사용한 답안

import re

text = ''
for \_ in range(int(input())):
text = re.sub(r'<!.+-->',r' ',(text+input()))

for er in re.findall(r'<([^/][^>]_)>', text):
if ' ' in er:  
 for ere in re.findall(r'([a-z]+)? _([a-z-]+)="([^"]+)', er):
if ere[0]:
print(ere[0])  
 print('-> '+ere[1]+' > '+ere[2])
else:
print(er)

Validating UID

문제 : unique identification number(UID)를 구별하는 문제
조건 : 영숫자 범위(a-z, A-Z, 0-9) 사이에서 최소 2개의 대문자, 최소 3개의 숫자를 반복없이 포함해야하고 총 길이는 10이다
예제 : B1CD102354 에서는 1이 중복된다 -> Invalid

INPUT
OUTPUT

input

2
B1CD102354
B1CDEF2354

output

Invalid
Valid

매우 깔끔하고 좋은 것 같다. all() 함수의 사용에 익숙해지자.
특히 문자의 반복을 확인하는 정규표현식을 잘 기억하자
(?!패턴) : 부정적 뒤보기 선언. 이후에 나올 문자들이 ‘패턴’에 매치되어서는 안 된다.
참조 : \n : 예를 들어 작은 따옴표나 큰 따옴표 내에 한 개 이상의 문자가 있는 표현식인 r'['"][^'"]*['"]'가 있을 때, 시작하는 따옴표와 끝나는 따옴표가 똑같게 만들려면 r'(['"])[^'"]*\1' 이렇게 표현하면 된다. 여기서 \1은 첫번째 괄호인 ([‘“])이 매치되는 것과 같다.

import re

no_repeats = r"(?!._(.)._\1)"
two_or_more_upper = r"(._[A-Z]){2,}"
three_or_more_digits = r"(._\d){3,}"
ten_alphanumerics = r"[a-zA-Z0-9]{10}"
filters = no_repeats, two_or_more_upper, three_or_more_digits, ten_alphanumerics

for uid in [input() for _ in range(int(input()))]:
if all(re.match(f, uid) for f in filters):
print("Valid")
else:
print("Invalid")

assert 조건식은 처음봤는데 다음에 자세히 살펴봐야겠다. 우선 정렬시키고 조건을 확인하는 것이 인상적이였다

import re

for \_ in range(int(input())):
u = ''.join(sorted(input()))
try:
assert re.search(r'[A-Z]{2}', u)
assert re.search(r'\d\d\d', u)
assert not re.search(r'[^a-za-z0-9]', u)
assert not re.search(r'(.)\1', u)
assert len(u) == 10
except:
print('Invalid')
else:
print('Valid')

Validating Credit Card Numbers

문제 : 신용카드 번호의 유효성을 확인하는 문제
조건 :
4, 5, 6 중 하나로 시작해야한다
총 16자리이다
0-9 사이의 숫자로 이루어져있다
- 로 4자리씩 구분되어 있다
, _ 와 같은 다른 구분자를 사용해서는 안된다
숫자가 연속으로 4번 반복하면 안된다

INPUT
OUTPUT

input

6
4123456789123456
5123-4567-8912-3456
61234-567-8912-3456
4123356789123456
5133-3367-8912-3456
5123 - 3567 - 8912 - 3456

output

Valid # 4123456789123456
Valid # 5123-4567-8912-3456
Invalid # 61234-567-8912-3456 <- 567이 3자리이다
Valid # 4123356789123456
Invalid # 5133-3367-8912-3456 <- 33-33 반복이다
Invalid # 5123 - 3567 - 8912 - 3456 <- 구분자로 공백이 포함되어 있다

import re

def validate_credit_cards(credit_cards):
valid_structure = r"[456]\d{3}(-?\d{4}){3}$"
no_four_repeats = r"((\d)-?(?!(-?\2){3})){16}"
filters = valid_structure, no_four_repeats

      if all(re.match(f, credit_cards) for f in filters):
          print("Valid")
      else:
          print("Invalid")

for \_ in range(int(input())):
credit = input()
validate_credit_cards(credit)

정규식을 여러줄에 걸쳐서 작성할 수도 있음을 알게 되었다

import re
pattern = re.compile(r"^"
r"(?!.\*(\d)(-?\1){3})"
r"[456]"
r"\d{3}"
r"(?:-?\d{4}){3}"
r"$")
for \_ in range(int(input().strip())):
print("Valid" if pattern.search(input().strip()) else "Invalid")

Validating Postal Codes

문제 : 유효한 우편번호(postal code)를 확인하는 문제
조건 :
100000 - 999999 사이의 번호여야 한다
하나 이상의 alternating repetitive digit pair 가 없어야한다
(alternating repetitive digit란, 숫자 하나 건너 똑같은 숫자가 나오는 것을 말한다)
예제 :
523563 # 여기서는 alternating repetitive digit이 없다
552523 # 여기서는 숫자 2와 5가 alternating repetitive digits에 해당한다

INPUT
OUTPUT

110000

False

정규표현식의 시작과 끝을 정확히 명시해 줘야한다! ^ $를 명시하지 않으면 6자리를 넘어가는 번호도 유효하다고 처리한다

regex_integer_in_range = r"^[1-9]\d{5}$"
regex_alternating_repetitive_digit_pair = r"(\d)(?=\d\1)"

import re
P = input()

print (bool(re.match(regex_integer_in_range, P))
and len(re.findall(regex_alternating_repetitive_digit_pair, P)) < 2)

Matrix Script

문제 : 행렬로 암호가 주어졌을 때 이를 해독하는 문제
해독 방법 : 행으로 읽어가며 영문자만 가져온다

INPUT
OUTPUT

input

7 3
Tsi
h%x
i #
sM
$a
#t%
ir!

output

This is Matrix# %!

문자와 문자 사이에 특수문자가 있으면 ‘ ‘으로 치환하면 된다

import re

regex = re.compile(r'(?<=\w)(\W+)(?=\w)')
N, M = (int(num) for num in input().split())
encoded = [input() for _ in range(N)]
decoded = ''.join(row[letter] for letter in range(M) for row in encoded)
print(regex.sub(' ', decoded))

(PYTHON) Day - 22 Regex and Parsing(2)

작성일 2020-03-13 In LANGUAGE 🚀 , PYTHON , HACKERRANK 댓글:

Reference

문제 출처 - HackerRank
파이썬 연습 - Practice - Python

개인적인 생각과 상상으로 작성한 내용들이 포함되어 있습니다
문제를 풀고 Discussion Tab을 참고하며 코드 스타일을 개선하려고 노력하고자 합니다

HackerRank

HackerRank의 Python 연습문제들은 아래와 같은 카테고리로 분류 된다

Subdomain

- ~~Introduction~~
- ~~Basic Data Types~~
- ~~Strings~~
- ~~Sets~~
- ~~Math~~
- ~~Itertools~~
- ~~Collections~~
- ~~Date and Time~~
- ~~Errors and Exceptions~~
- ~~Classes~~
- ~~Built-Ins~~
- ~~Python Functionals~~
- <strong style="color:blue">Regex and Parsing</strong>
- XML
- Closures and Decorators
- Numpy
- Debugging

Regex and Parsing

Detect Floating Point Number

문제 : 입력이 형식에 맞는지 확인하는 문제
입력 : 입력받을 개수 N; (N 반복) 문자열;
출력 : 각 문자열이 형식에 맞는지 True/False 로 출력
형식 :

숫자는 +/-/. 으로 시작 할 수 있다
숫자는 최소 1개의 소수점 아래 값을 가진다(즉, 정수가 아님)
float() 타입으로 변환할 때 오류가 없어야 한다

INPUT
OUTPUT

4
4.0O0
-1.00
+4.54
SomeRandomStuff

False
True
True
False

정규표현식 기호를 잘 외우자

'''
^: 시작을 나타냄
[+-]?: (`+` 혹은 `-`) 둘 중 하나가 있거나 없을 수 있다
\d\*: 숫자가 있을 수도 있고 없을 수도 있다 <- \d는 [0-9]로 표현할 수도 있다
\.: `.` 문자
\d+: 적어도 하나의 숫자가 있다
$: 끝을 나타냄
'''

import re

for \_ in range(int(input())):
n = input()
print(bool(re.search(r'^[+-]?\d*\.\d+$', n))) # print(bool(re.search(r'^[+-]?[0-9]*\.[0-9]+$', n)))

Re.split()

문제 : 쉼표나 점을 구분하는 정규표현식

INPUT
OUTPUT

100,000,000.000

100
000
000
000

그냥 내장함수로도 구현할 수 있지 않나?

print(\*input().replace(',', '.').split('.'), sep='\n')

어쨌든 정규표현식을 사용하면 다음과 같다

regex_pattern = r'[\.\,]'

import re
print("\n".join(re.split(regex_pattern, input())))

Group(), Groups() & Groupdict()

문제 : 처음으로 연속 반복되는 영문자 혹은 숫자를 출력하는 문제
입력 : 문자열
출력 : 반복되는 첫 영숫자(없다면 -1 출력)
예제 : .은 반복되지만 영숫자(alphanumeric)가 아님, 111이 반복되므로 답은 1

INPUT
OUTPUT

..12345678910111213141516171820212223

3가지 표현이 가능함

import re

# S = re.search(r"([A-Za-z0-9])\1+", input())

# S = re.search(r"(\w(?!\_))\1+", input())

S = re.search(r"([^\w_])\1+", input())

# print(S[1] if S else -1)

print(S.group(1) if S else -1)

Re.findall() & Re.finditer()

문제 : 문자열에서 양끝이 자음(consonant)이면서 2개의 모음(vowel)을 포함하는 서브 문자열을 찾는 문제
입력 : 공백, +, - 를 포함하는 문자열
출력 : 조건을 만족하는 문자열

INPUT
OUTPUT

rabcdeefgyYhFjkIoomnpOeorteeeeet

ee
Ioo
Oeo
eeeee

모범 답안
조건이 있는 표현식(?<= 와 같은) 을 잘 숙지할 것
컴파일 옵션으로 re.IGNORECASE 를 사용하여 대소문자를 구분하지 않아도 됨

import re

VOWELS = 'aeiou'
CONSONANTS = 'bcdfghjklmnpqrstvwxyz'
REGEX = '(?<=[' + CONSONANTS + '])([' + VOWELS + ']{2,})[' + CONSONANTS + ']'

match = re.findall(REGEX, input(), re.IGNORECASE)
if match:
print(\*match, sep='\n')
else:
print('-1')

Re.start() & Re.end()

문제 : 찾는 문자열의 위치(범위)를 출력하는 문제
입력 : 문자열 S; 찾고자 하는 문자열 k;
출력 : S에서 k문자열의 위치들

INPUT
OUTPUT

aaadaa
aa

(0, 1)
(1, 2)
(4, 5)

조금 비효율적인 방법인 것 같다ㅠ

import re

S = input()
k = input()
pattern = re.compile(k)
searched = pattern.search(S)

if searched is None:
print(-1, -1)
else:
for i in range(len(S)):
if pattern.match(S[i:]):
print((i,i+len(k)-1))

조금 정리한 버전

import re

S = input()
k = input()

pattern = re.compile(k)
r = pattern.search(S)

if not r:
print((-1, -1))
while r:
print("({0}, {1})".format(r.start(), r.end() - 1))
r = pattern.search(S, r.start() + 1)

Regex Substitution

문제 : 기호를 문자로 바꾸는 문제 (&& → and, || → or)
입력 : 입력줄 수 N; (N 반복) 문자열;
출력 : 바뀐 문자열

INPUT
OUTPUT

11
a = 1;
b = input();

if a + b > 0 && a - b < 0:
start()
elif a*b > 10 || a/b < 1:
stop()
print set(list(a)) | set(list(b))

#Note do not change &&& or ||| or & or |

#Only change those ‘&&’ which have space on both sides.

#Only change those ‘|| which have space on both sides.

a = 1;
b = input();

if a + b > 0 and a - b < 0:
start()
elif a*b > 10 or a/b < 1:
stop()
print set(list(a)) | set(list(b))

#Note do not change &&& or ||| or & or |

#Only change those ‘&&’ which have space on both sides.

#Only change those ‘|| which have space on both sides.

해당 단원에서는 정규식 사용해서 푸는 것을 권장하지만, 이 문제는 굳이 사용하지 않아도 될 것 같다

for \_ in range(int(input())):
line = input()

    while ' && ' in line or ' || ' in line:
        line = line.replace(" && ", " and ").replace(" || ", " or ")

    print(line)

앞뒤 빈칸을 신경써서 문제를 풀어야한다… 그냥 위 방식이 더 깔끔한 것 같다

import re

for \_ in range(int(input())):
S = input()
S = re.sub(r' &&(?= )', ' and', S)
S = re.sub(r' \|\|(?= )', ' or', S)
print(S)

Validating Roman Numerals

문제 : 로마숫자가 4000보다 작은지 확인하는 문제
입력 : 로마숫자
출력 : True/False

INPUT
OUTPUT

CDXXI

True

로마숫자 모듈이 존재한다는 것을 처음 알았다

from roman import fromRoman

try:
if 0<fromRoman(input())<4000:
print(True)
else:
print(False)
except:
print(False)

rf'' 이렇게도 작성이 가능하구나… 그냥 적어봤는데 에러가 안 나서 신기했다

thousand = 'M{0,3}'
hundred = '(C[MD]|D?C{0,3})'
ten = '(X[CL]|L?X{0,3})'
digit = '(I[VX]|V?I{0,3})'

regex_pattern = rf"{thousand}{hundred}{ten}{digit}$"

import re
print(str(bool(re.match(regex_pattern, input()))))

Validating phone numbers

문제 : 7,8,9 로 시작하며 10자리인 전화번호를 구분하는 문제
입력 : 번호 개수 N; (N 반복) 문자열;
출력 : Yes/No

INPUT
OUTPUT

2
9587456281
1252478965

YES
NO

import re

for \_ in range(int(input())):
line = input()
if re.match(r"^[789]{1}\d{9}$", line):
print("YES")
else:
print("NO")

Validating and Parsing Email Addresses

문제 : 양식에 맞는 이메일만 구별해서 출력
입력 : 이메일 개수 n; (n 반복) 이름, 이메일 주소;
출력 : name user@email.com 형식

INPUT
OUTPUT

2
DEXTER dexter@hotmail.com
VIRUS <virus!@variable.:p>

DEXTER dexter@hotmail.com

그냥 정규식만 사용

import re
pattern=r'(?<=<)[a-z][a-z0-9\.\_\-]*@[a-z]+\.[a-z]{1,3}(?=>)'
for \_ in range(int(input())):
s = input()
if bool(re.search(pattern,s,re.IGNORECASE)):
print(s)

email.utils 모듈 사용

import re
import email.utils
pattern=r'[a-z][a-z0-9\.\_\-]\*@[a-z]+\.[a-z]{1,3}$'
for \_ in range(int(input())):
tup=email.utils.parseaddr(input())
if bool(re.match(pattern,tup[1],re.IGNORECASE)):
print(email.utils.formataddr(tup))

(PYTHON) Day - 22 Regex and Parsing(1)

작성일 2020-03-13 In LANGUAGE 🚀 , PYTHON , HACKERRANK 댓글:

Reference

문제 출처 - HackerRank
파이썬 연습 - Practice - Python

개인적인 생각과 상상으로 작성한 내용들이 포함되어 있습니다
문제를 풀고 Discussion Tab을 참고하며 코드 스타일을 개선하려고 노력하고자 합니다

HackerRank

HackerRank의 Python 연습문제들은 아래와 같은 카테고리로 분류 된다

Subdomain

- ~~Introduction~~
- ~~Basic Data Types~~
- ~~Strings~~
- ~~Sets~~
- ~~Math~~
- ~~Itertools~~
- ~~Collections~~
- ~~Date and Time~~
- ~~Errors and Exceptions~~
- ~~Classes~~
- ~~Built-Ins~~
- ~~Python Functionals~~
- <strong style="color:blue">Regex and Parsing</strong>
- XML
- Closures and Decorators
- Numpy
- Debugging

Regex and Parsing

기본개념

정규 표현식 시작하기 - 점프 투 파이썬
파이썬 정규 표현식 - Google for Education

주의해야하는 메타 문자
메타 문자 중 ^ 는 [] 문자 클래스 안에 있을 때랑 밖에 있을 때 의미가 다른 것을 주의!!

[^0-9] # 숫자를 제외한 문자만 매치
[^abc] # a, b, c를 제외한 모든 문자와 매치
^a # a로 시작하는 문자
'''
a # 매치
aaa # 매치
baaa # 매치 안됨
'''

자주 사용하는 문자 클래스

\d - 숫자와 매치, [0-9]와 동일한 표현식이다.
\D - 숫자가 아닌 것과 매치, [^0-9]와 동일한 표현식이다.
\s - whitespace 문자와 매치, [ \t\n\r\f\v]와 동일한 표현식이다. 맨 앞의 빈 칸은 공백문자를 의미한다.
\S - whitespace 문자가 아닌 것과 매치, [^ \t\n\r\f\v]와 동일한 표현식이다.
\w - 문자+숫자(alphanumeric)와 매치, [a-zA-Z0-9_]와 동일한 표현식이다.
\W - 문자+숫자(alphanumeric)가 아닌 문자와 매치, [^a-za-z0-9_]와 동일한 표현식이다.

패턴 매칭

search()
패턴이 발견되는 첫번째 위치를 찾는다


> > > import re
> > > print(bool(re.search(r"ly","similarly")))
> > > True
> > > print(re.search(r"ly","similarly"))
> > > <re.Match object; span=(7, 9), match='ly'>

match()
문자열의 첫 부분이 패턴과 일치하는지 확인한다


> > > import re
> > > print(bool(re.match(r"ly","similarly")))
> > > False
> > > print(re.match(r"ly","similarly"))
> > > None
> > > print(bool(re.match(r"ly","ly should be in the beginning")))
> > > True
> > > print(re.match(r"ly","ly should be in the beginning"))
> > > <re.Match object; span=(0, 2), match='ly'>

구분 (split)

split()
re 모듈의 split() 함수는 여러 구분자를 동시에 사용하여 구분할 수 있는 장점이 있다


> > > import re
> > > re.split(r"-","+91-011-2711-1111")
> > > ['+91', '011', '2711', '1111']  
> > > re.split(r"[@\.]", "username@hackerrank.com")
> > > ['username', 'hackerrank', 'com']

> > >

그룹 (group)

음.. dict 자료형으로 정리할 때 유용할 것 같다

group()


> > > import re
> > > m = re.match(r'(\w+)@(\w+)\.(\w+)','username@hackerrank.com')
> > > m.group(0) # The entire match
> > > 'username@hackerrank.com'
> > > m.group(1) # The first parenthesized subgroup.
> > > 'username'
> > > m.group(2) # The second parenthesized subgroup.
> > > 'hackerrank'
> > > m.group(3) # The third parenthesized subgroup.
> > > 'com'
> > > m.group(1,2,3) # Multiple arguments give us a tuple.
> > > ('username', 'hackerrank', 'com')

groups()


> > > import re
> > > m = re.match(r'(\w+)@(\w+)\.(\w+)','username@hackerrank.com')
> > > m.groups()
> > > ('username', 'hackerrank', 'com')

groupdict()


> > > m = re.match(r'(?P<user>\w+)@(?P<website>\w+)\.(?P<extension>\w+)','myname@hackerrank.com')
> > > m.groupdict()
> > > {'website': 'hackerrank', 'user': 'myname', 'extension': 'com'}

검색 (find)

findall()


> > > import re
> > > re.findall(r'\w','http://www.hackerrank.com/')
> > > ['h', 't', 't', 'p', 'w', 'w', 'w', 'h', 'a', 'c', 'k', 'e', 'r', 'r', 'a', 'n', 'k', 'c', 'o', 'm']

finditer()


> > > import re
> > > re.finditer(r'\w','http://www.hackerrank.com/')
> > > <callable-iterator object at 0x0266C790>
> > > map(lambda x: x.group(),re.finditer(r'\w','http://www.hackerrank.com/'))
> > > ['h', 't', 't', 'p', 'w', 'w', 'w', 'h', 'a', 'c', 'k', 'e', 'r', 'r', 'a', 'n', 'k', 'c', 'o', 'm']

치환

sub()
간단한 기본 예제

import re

#Squaring numbers
def square(match):
number = int(match.group(0))
return str(number\*\*2)

print re.sub(r"\d+", square, "1 2 3 4 5 6 7 8 9")

# output: 1 4 9 16 25 36 49 64 81

흔히 html 문서에서 tag 들을 없앨 때 자주 사용된다

import re

html = """

<head>
<title>HTML</title>
</head>
<object type="application/x-flash"
  data="your-file.swf"
  width="0" height="0">
  <!-- <param name="movie"  value="your-file.swf" /> -->
  <param name="quality" value="high"/>
</object>
"""

print re.sub("(<!--.*?-->)", "", html) #remove comment

output


<head>
<title>HTML</title>
</head>
<object type="application/x-flash"
  data="your-file.swf"
  width="0" height="0">

  <param name="quality" value="high"/>
</object>

NUNU

개인적으로 공부하면서 정리한 내용들을 블로그에 남기고 있습니다.

RSS