_yield_len_iter

+흑미+ 2017. 8. 29. 18:47

2017. 8. 29. 18:47

# _yield_len_iter

# 토크나이징을 하는 작업 등은 text 파일의 line by line으로 처리

# 메모리에 올리지 않아도 됨

# generator 형태로 text를 메모리에 line by line으로 올려서 작업을 하려면 __iter__ 함수를 구현

# DoublespaceLineCorpus는 이미 iter가 구현

class DoublespaceLineCorpus:

def __init__(self, corpus_fname, iter_sent = False):

self.corpus_fname = corpus_fname

self.iter_sent = iter_sent

def __iter__(self):

with open(self.corpus_fname, encoding='utf-8') as f:

for doc_idx, doc in enumerate(f):

if not self.iter_sent:

yield doc

continue

for sent in doc.split(' '):

sent = sent.strip()

if not sent: continue

yield sent

# doc의 sent에 대하여 empty str이 아니면...

# 아래의 "for snet in corpus"부분에 sent로 메모리에

# 그 str 값을 올리라는 의미

corpus = DoublespaceLineCorpus('./day0_3.txt', iter_sent=True)

for n_sent, sent in enumerate(corpus):

print('{} snet: {}. itslength is {}'.format(n_sent, sent, len(sent)))

# 위의 예시에서 메모리에는 for loop에서 정의되는 sent만 만들어져 사용됨

# 다음 loop로 넘어가면서 메모리에 올라가지 않음

0 snet: 테스트 데이터 입니다. itslength is 11

1 snet: 이것만 한 줄에 두 개의 문장을 넣어둘 겁니다. itslength is 25

2 snet: 문장을 이렇게 만들어 둘거에요. itslength is 16

3 snet: yield test와. itslength is 11

4 snet: len test를 해보세요. itslength is 14

# len(corpus)하면 length 출력하게 만들기

class DoublespaceLineCorpus:

def __init__(self, corpus_fname, iter_sent = False):

self.corpus_fname = corpus_fname

self.iter_sent = iter_sent

self.num_sents = 0

self.num_docs = 0

def __iter__(self):

with open(sef.corpus_fname, encoding='utf-8') as f:

for doc_idx, doc in enumerate(f):

if not self.iter_sent:

yield doc

continue

for sent in doc.split(' '):

sent = sent.strip()

if not sent: continue

yield sent

def __len__(self):

if self.iter_sent:

if self.num_sents == 0:

with open(self.corpus_fname, encoding='utf-8') as f:

self.num_sents = sum((True for doc in f for sent in doc.strip().split(' ') if sent.strip()))

return self.num_sents

else:

if self.num_docs == 0:

with open(self.corpus_fname, encoding='utf-8') as f:

self.num_docs = sum((True for doc in f if doc.strip()))

return self.num_docs

corpus = DoublespaceLineCorpus('./day0_3.txt', iter_sent=True)

print('when iter_sent == True, n sents = {}'.format(len(corpus)))

when iter_sent == True, n sents = 5

corpus.iter_sent=False

print('when iter_sent == False, n docs = {}'.format(len(corpus)))

when iter_sent == False, n docs = 4

'파이썬' 카테고리의 다른 글

beautiful soup 4, lxml, requests 설치 방법 (1)	2017.08.30
Scraping Naver Movie (0)	2017.08.30
slice and sorting (0)	2017.08.29
list comprehension (0)	2017.08.28
Pickle (0)	2017.08.28

흑미의 블로그

_yield_len_iter

'파이썬' 카테고리의 다른 글

+ Recent posts

티스토리툴바