Generator functions
We now know that once the interpreter executes a return
statement in a function, the control is given back permanently to the place where the function was called from. Sure, we may be able to access some values of this expired function using closures. Sometimes it is necessary to pass the control back to the function even after the caller has received something from the function. This is where a different keyword comes into play—yield
.
In order to understand this, let us write a function.
def yield_demo():
yield 1
yield "two"
yield 3.0
There is no return
statement here. In fact, Python 2 doesn't allow mixing yield and return and throws a SyntaxError
. Python 3 is slightly forgiving and treats return
as a syntactic sugar for raising a StopIteration
exception.
>>> g=yield_demo()
>>> print(g)
<generator object yield_demo at 0x7f5af3a51db0>
>>> next(g)
1
>>> next(g)
'two'
>>> next(g)
3.0
>>> next(g)
<Traceback message>
StopIteration
The first thing we notice is that yield_demo
doesn't return a value when called. Instead, it returns a generator object whose subsequent value must be extracted using the next
built-in function.
Where does the persistence of function come into play? Take a look at the line next(g)
and the output is 1
. At this point, Python has executed the line yield 1
in the execution instance yield_demo
that has been bound to a generator object g
. Although the control of execution is send back to Python's REPL shell, the state of that particular execution instance of yield_demo
is still in the memory. When next(g)
is called again, the execution resumes from the last encountered yield statement. This goes on until no more yield is available and Python automatically raises a StopIteration
exception.
The simplest way to call next
is not to use use the next
statement itself but instead use a for
loop.
>>> g=yield_demo()
>>> for item in g:
… print(item)
…
1
two
3.0
for
statement uses the StopIteration
exception to terminate loop. In case a for loop doesn't catch any item, it may mean that the generator has been exhausted. This can occur if a few nexts
or some for
loops have already extracted some items from the generator earlier.
Pipelines using generators
It is possible to chain generators. A generator function can in turn call another generator function which in turn can call another, and so on. In this fashion, we can build pipelines.
Here is a piece of code that generates a Bruce Lee and a Neil Gaiman quote line-by-line. It also has a generator function count_word
that takes in a generator function, receives a yielded line, counts the number of words and yields that.
def source_lee():
quote = ["Empty your mind, be formless, shapeless, like water.",
"If you put water into a cup, it becomes the cup.",
"You put water into a bottle and it becomes the bottle.",
"You put it in a teapot it becomes the teapot."
"Now, water can flow or it can crash.",
"Be water my friend."]
for line in quote:
yield line
def source_gaiman():
quote = ["Husband runs off with a politician? Make good art.",
"Leg crushed and then eaten by mutated boa constrictor? Make good art.",
"IRS on your trail? Make good art.",
"Cat exploded? Make good art.",
"Somebody on the Internet thinks what you do is stupid or evil or it's all been done before? Make good art.",
"Probably things will work out somehow, and eventually time will take the sting away, but that doesn't matter.",
"Do what only you do best. Make good art."]
for line in quote:
yield line
def count_word(source):
gen_source=source()
for line in gen_source:
yield len(line.split(" "))
We can now chain generators to create a master generator.
>>> wc_lee = count_word(source_lee)
>>> next(wc_lee)
8
>>> next(wc_lee)
11
>>> next(wc_lee)
11
>>> wc_gaiman = count_word(source_gaiman)
>>> next(wc_gaiman)
9
>>> next(wc_gaiman)
12
>>> next(wc_lee)
17
The next
call propagates itself down the nested generators and pulls up the data while each generator (in this case, count_word
) processes the data (in this case, counts words) and passes it on.
This is very similar to building Unix pipelines.
Let us say that we have a very large Python file called vectors.py
. It contains a lot of functions. In Unix (or Unix-like systems) it is very easy to count the number of such functions. Here's one way to do it—
$ cat vectors.py | grep "^\s\+def " | wc -l
64
The contents of the file—line by line—is the output to cat vectors.py
; which is fed as input to grep "^\s\+def "
to filter those lines which match our RegEx; which in turn is fed into wc -l
to count the number of lines. Piping of output as input to a command is a fundamental concept of Unix. These pipelines are very similar to those built by generators, where the output of a generator is passed to the next. Let us try to emulate this behaviour.
import re
def cat(filename):
with open(filename, 'rU') as fp:
while True:
line = fp.readline()
if not line:
break
else:
yield line
def grep(regexp, str_gen_in):
for line in str_gen_in:
match = re.findall(regexp, line)
if match:
yield match
def wc_l(str_gen_in):
return len(list(str_gen_in))
The above method is a very convoluted and incomplete method of creating a proxy for the cat
and grep
command in Python. It's only for illustrative purposes.
We can chain these functions to create a pipe.
>>> wc_l(grep("^\s+def ", cat("vectors.py")))
64
The function wc_l
is not a generator and has a return
instead of yield
. This is to consolidate the data from the generators. In fact, we are forcing wc_l
to evaluate all the yields of the input generator by converting it into a list. The other two functions are effectively piped before wc_l
just like the Unix command we saw earlier.
For the Python cat
function, the generator stops when line returns an empty string. For the Python grep
function, the generator pulls from str_gen_in
only if match
is not an empty list and stops when str_gen_in
is exhausted.
One can build useful applications using this concept. Parsing large text files (like logs) and processing incoming data streams would be some of the more interesting usages of this concept.