sider.ext.wsgi_referer_stat
— Collecting referers using sorted sets¶
This tutorial will show you a basic example using sorted sets. We will build a small WSGI middleware that simply collects all Referers of the given WSGI web application.
WSGI and middlewares¶
WSGI is a standard interface between web servers and Python web applications or frameworks to promote web application portability across a variety of web servers. (If you are from Java, think servlet. If you are from Ruby, think Rack.)
WSGI applications can be deployed into WSGI containers (server implementations). There are a lot of production-ready WSGI containers. Some of these are super fast, and some of others are very reliable. Check Green Unicorn, uWSGI, mod_wsgi, and so forth.
WSGI middleware is somewhat like decorator pattern for WSGI
applications. Usually they are implemented using nested
higher-order functions or classes with __call__()
special method.
See also
To learn more details about WSGI, read PEP 333 and other related resources. This tutorial doesn’t deal with WSGI.
- PEP 333 — Python Web Server Gateway Interface v1.0
- This document specifies a proposed standard interface between web servers and Python web applications or frameworks, to promote web application portability across a variety of web servers.
- Getting Started with WSGI by Armin Ronacher
- Armin Ronacher, the author of Flask, Werkzeug and Jinja, wrote this WSGI tutorial.
- A Do-It-Yourself Framework by Ian Bicking
- Ian Bicking, the author of Paste, WebOb, lxml.html and FormEncode, explains about WSGI apps and middlewares.
Simple idea¶
The simple idea we’ll implement here is to collect all Referer and store it into a persistent storage. We will use Redis as its persistent store. We want to increment the count for each Referer.
Stored data will be like:
Referer | Count |
---|---|
http://dahlia.kr/ | 1 |
https://github.com/dahlia/sider | 3 |
https://twitter.com/hongminhee | 6 |
We could use a hash here, but sorted set seems more suitable. Sorted sets are a data structure provided by Redis that is basically a set but able to represent duplications as its scores (ZINCRBY).
We can list a sorted set in asceding (ZRANGE) or descending order (ZREVRANGE) as well.
See also
- Redis Data Types
- The Redis documentation that explains about its data types: strings, lists, sets, sorted sets and hashes.
Prototyping with using in-memory dictionary¶
First of all, we can implement a proof-of-concept prototype without Redis.
Python has no sorted sets, so we will use dict
instead.
class RefererStatMiddleware(object):
'''A simple WSGI middleware that collects :mailheader:`Referer`
headers.
'''
def __init__(self, application):
assert callable(application)
self.application = application
self.referer_set = {}
def __call__(self, environ, start_response):
try:
referer = environ['HTTP_REFERER']
except KeyError:
pass
else:
try:
self.referer_set[referer] += 1
except KeyError:
self.referer_set[referer] = 1
return self.application(environ, start_response)
It has some problems yet. What are that problems?
- WSGI applications can be deployed into multiple server nodes,
or forked to multiple processes as well. That means:
RefererStatMiddleware.referer_set
attribute can be split and not shared. - Increments of duplication counts aren’t atomic.
- Data will be lost when server process is terminated.
We can solve those problems by using Redis sorted sets instead of
Python in-memorty dict
.
Sider and persistent objects¶
It’s a simple job, so we can deal with Redis commands by our hands. However it’s a tutrial example of Sider. :-) We will use Sider’s sorted set abstraction here instead. It’s more abstracted away and easier to use!
Before touch our middleware code, the following session in Python interactive shell can make you understand basic of how to use Sider:
>>> from redis.client import StrictRedis
>>> from sider.session import Session
>>> from sider.types import SortedSet
>>> session = Session(StrictRedis())
>>> my_sorted_set = session.get('my_sorted_set', SortedSet)
>>> my_sorted_set
<sider.sortedset.SortedSet ('my_sorted_set') {}>
Note
Did you face ImportError
?
>>> from redis.client import StrictRedis
Traceback (most recent call last):
File "<console>", line 1, in <module>
ImportError: No module named redis
You probably didn’t install Python redis client library. You can install it through pip:
$ pip install redis
Or easy_install:
$ easy_install redis
Okay, here’s an empty set: my_sorted_set
. Let’s add something to it.
>>> my_sorted_set
<sider.sortedset.SortedSet ('my_sorted_set') {}>
>>> my_sorted_set.add('http://dahlia.kr/') # ZINCRBY
>>> my_sorted_set
<sider.sortedset.SortedSet ('my_sorted_set') {'http://dahlia.kr/'}>
Unlike Python’s in-memory set
or dict
,
it’s a persistent object. In other words, my_sorted_set
still contains 'http://dahlia.kr/'
even if you quit this session of
Python interactive shell. Try yourself: type exit()
to quit the session and enter python again. And then...
>>> my_sorted_set
Traceback (most recent call last):
File "<console>", line 1, in <module>
NameError: global name 'my_sorted_set' is not defined
I didn’t lie! You need to load the Sider session first.
>>> from redis.client import StrictRedis
>>> from sider.session import Session
>>> from sider.types import SortedSet
>>> client = StrictRedis()
>>> session = Session(client)
>>> my_sorted_set = session.get('my_sorted_set', SortedSet)
Then:
>>> my_sorted_set
<sider.sortedset.SortedSet ('my_sorted_set') {'http://dahlia.kr/'}>
Yeah!
Note that the following line:
>>> client = StrictRedis()
tries to connect to Redis server on localhost:6379 by default.
There are host
and port
parameters to configure it.
>>> client = StrictRedis(host='localhost', port=6379)
Sorted sets¶
You can update()
multiple values at a time:
>>> my_sorted_set.update(['https://github.com/dahlia/sider',
... 'https://twitter.com/hongminhee']) # ZINCRBY
>>> my_sorted_set
<sider.sortedset.SortedSet ('my_sorted_set')
{'https://github.com/dahlia/sider', 'https://twitter.com/hongminhee',
'http://dahlia.kr/'}>
>>> my_sorted_set.update(['http://dahlia.kr/',
... 'https://twitter.com/hongminhee']) # ZINCRBY
>>> my_sorted_set
<sider.sortedset.SortedSet ('my_sorted_set')
{'https://github.com/dahlia/sider', 'https://twitter.com/hongminhee': 2.0,
'http://dahlia.kr/': 2.0}>
>>> my_sorted_set['http://dahlia.kr/'] # ZSCORE
2.0
>>> my_sorted_set.add('http://dahlia.kr/')
>>> my_sorted_set['http://dahlia.kr/'] # ZSCORE
3.0
As you can see, doubly added members get double scores. This property is what we will use in the middleware.
You can list values and these scores the sorted set contains.
Similar to dict
there’s items()
method.
>>> my_sorted_set.items() # ZRANGE
[('https://github.com/dahlia/sider', 1.0),
('https://twitter.com/hongminhee', 2.0),
('http://dahlia.kr/', 2.0)]
>>> my_sorted_set.items(reverse=True) # ZREVRANGE
[('http://dahlia.kr/', 2.0),
('https://twitter.com/hongminhee', 2.0),
('https://github.com/dahlia/sider', 1.0)]
There are other many features to SortedSet
type,
but it’s all we need to know to implement the middleware. So we stop
introduction of the type to step forward.
Replace dict
with SortedSet
¶
To replace dict
with SortedSet
,
look RefererStatMiddleware.__init__()
method first:
def __init__(self, application):
self.application = application
self.referer_set = {}
Note
The following codes implictly assumes that it imports:
from redis.client import StrictRedis
from sider.session import Session
from sider.types import SortedSet
The above code can be easily changed to:
def __init__(self, application):
assert callable(application)
self.application = application
client = StrictRedis()
session = Session(client)
self.referer_set = session.get('wsgi_referer_set', SortedSet)
It should be more configurable by users. Redis key is currently hard-coded
as wsgi_referer_set
. It can be parameterized, right?
def __init__(self, set_key, application):
assert callable(application)
self.application = application
client = StrictRedis()
session = Session(client)
self.referer_set = session.get(str(set_key), SortedSet)
It still lacks configurability. Users can’t set address of Redis server
to connect. Parameterize session
as well:
def __init__(self, session, set_key, application):
assert isinstance(session, Session)
assert callable(application)
self.application = application
self.referer_set = session.get(str(set_key), SortedSet)
Okay, it’s enough flexible to environments. Our first and third problems have just solved. Its data become shared and don’t be split anymore. No data loss even if process has terminated.
Next, we have to make increment atomic. See a part of
RefererStatMiddleware.__call__()
method:
try:
self.referer_set[referer] += 1
except KeyError:
self.referer_set[referer] = 1
Redis sorted set offers a simple atomic way to increase its score:
ZINCRBY. Sider maps ZINCRBY command to
SortedSet.add()
method.
So, those lines can be replaced by the following line:
self.referer_set.add(referer)
and it will be committed atomically.
Referer list page¶
Lastly, let’s add an additional page for listing collected referers. This page simply shows you list of referers and counts. Referers are ordered by these counts (descendingly).
To deal with HTML this example will use Jinja template engine. Its syntax is similar to Django template language, but more expressive. You can install it through pip or easy_install:
$ pip install Jinja2 # or:
$ easy_install Jinja2
Here is a HTML template code using Jinja:
<h1>Referer List</h1>
<table>
<thead>
<tr>
<th>URL</th>
<th>Count</th>
</tr>
</thead>
<tbody>
{% for url, count in referers %}
<tr>
<th><a href="{{ url|escape }}" rel="noreferrer">
{{- url|escape }}</a></th>
<td>{{ count|int }}</td>
</tr>
{% endfor %}
</tbody>
</table>
Save this template source to the file named templates/stat.html
.
Remember we used an undefined variable in the above template code:
referers
. So we have to pass this variable from the WSGI middleware code.
To load this template file, Jinja environment object has to be set in the web
application code. Append the following lines to
RefererStatMiddleware.__init__()
method:
loader = PackageLoader(__name__)
environment = Environment(loader=loader)
And then we now can load the template using Environment.get_template()
method. Append the following line to
RefererStatMiddleware.__init__()
method:
self.template = environment.get_template('stat.html')
When RefererStatMiddleware
is initialized its template will be loaded
together.
Next, let’s add a new stat_application()
method,
going to serve the list page, into the middleware class. This method has to
be a WSGI application as well:
def stat_application(self, environ, start_response):
content_type = 'text/html; charset=utf-8'
start_response('200 OK', [('Content-Type', content_type)])
referers = self.referer_set.items(reverse=True)
return self.template.render(referers=referers).encode('utf-8'),
Template.render()
method takes variables to
pass as keywords and returns a rendered result as unicode
string.
We have passed the referers
variable from this line. Its value is made
by SortedSet.items()
method with
reverse=True
option which means descending order.
To connect this modular WSGI application into the main application, we should
add the following conditional routine into the first of
RefererStatMiddleware.__call__()
method:
path = environ['PATH_INFO']
if path == '/__stat__' or path.startswith('/__stat__/'):
return self.stat_application(environ, start_response)
It will delegate its responsibility of responding to
stat_application()
application if a request is
to the path /__stat__
or its subpath.
Now go to /__stat__
page and then your browser will show a table like
this:
Referer List
URL Count https://twitter.com/hongminhee 6 https://github.com/dahlia/sider 3 http://dahlia.kr/ 1
Source code¶
The complete source code of this example can be found in
examples/wsgi-referer-stat/
directory of the repository.
https://github.com/dahlia/sider/tree/master/examples/wsgi-referer-stat
It’s public domain, feel free!
Final API¶
-
class
sider_wsgi_referer_stat.
RefererStatMiddleware
(session, set_key, application, stat_path='/__stat__')¶ A simple WSGI middleware that collects Referer headers and stores it into a Redis sorted set.
You can see the list of referrers ordered by duplication count in
/__stat__
page (or you can configure thestat_path
argument).Parameters: - session (
sider.session.Session
) – sider session object - set_key (
basestring
) – the key name of Redis sorted set to store data - application (
collections.Callable
) – wsgi app to wrap - stat_path (
basestring
) – path to see the collected data. default is'/__stat__'
. if it’sNone
the data cannot be accessed from outside
-
referer_set
= None¶ (
sider.sortedset.SortedSet
) The set of collected Referer strings.
-
stat_application
(environ, start_response)¶ WSGI application that lists its collected referers.
- session (