Import Upstream version 2.7.18
This commit is contained in:
766
Doc/howto/argparse.rst
Normal file
766
Doc/howto/argparse.rst
Normal file
@@ -0,0 +1,766 @@
|
||||
*****************
|
||||
Argparse Tutorial
|
||||
*****************
|
||||
|
||||
:author: Tshepang Lekhonkhobe
|
||||
|
||||
.. _argparse-tutorial:
|
||||
|
||||
This tutorial is intended to be a gentle introduction to :mod:`argparse`, the
|
||||
recommended command-line parsing module in the Python standard library.
|
||||
This was written for argparse in Python 3. A few details are different in 2.x,
|
||||
especially some exception messages, which were improved in 3.x.
|
||||
|
||||
.. note::
|
||||
|
||||
There are two other modules that fulfill the same task, namely
|
||||
:mod:`getopt` (an equivalent for :c:func:`getopt` from the C
|
||||
language) and the deprecated :mod:`optparse`.
|
||||
Note also that :mod:`argparse` is based on :mod:`optparse`,
|
||||
and therefore very similar in terms of usage.
|
||||
|
||||
|
||||
Concepts
|
||||
========
|
||||
|
||||
Let's show the sort of functionality that we are going to explore in this
|
||||
introductory tutorial by making use of the :command:`ls` command:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ ls
|
||||
cpython devguide prog.py pypy rm-unused-function.patch
|
||||
$ ls pypy
|
||||
ctypes_configure demo dotviewer include lib_pypy lib-python ...
|
||||
$ ls -l
|
||||
total 20
|
||||
drwxr-xr-x 19 wena wena 4096 Feb 18 18:51 cpython
|
||||
drwxr-xr-x 4 wena wena 4096 Feb 8 12:04 devguide
|
||||
-rwxr-xr-x 1 wena wena 535 Feb 19 00:05 prog.py
|
||||
drwxr-xr-x 14 wena wena 4096 Feb 7 00:59 pypy
|
||||
-rw-r--r-- 1 wena wena 741 Feb 18 01:01 rm-unused-function.patch
|
||||
$ ls --help
|
||||
Usage: ls [OPTION]... [FILE]...
|
||||
List information about the FILEs (the current directory by default).
|
||||
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
|
||||
...
|
||||
|
||||
A few concepts we can learn from the four commands:
|
||||
|
||||
* The :command:`ls` command is useful when run without any options at all. It defaults
|
||||
to displaying the contents of the current directory.
|
||||
|
||||
* If we want beyond what it provides by default, we tell it a bit more. In
|
||||
this case, we want it to display a different directory, ``pypy``.
|
||||
What we did is specify what is known as a positional argument. It's named so
|
||||
because the program should know what to do with the value, solely based on
|
||||
where it appears on the command line. This concept is more relevant
|
||||
to a command like :command:`cp`, whose most basic usage is ``cp SRC DEST``.
|
||||
The first position is *what you want copied,* and the second
|
||||
position is *where you want it copied to*.
|
||||
|
||||
* Now, say we want to change behaviour of the program. In our example,
|
||||
we display more info for each file instead of just showing the file names.
|
||||
The ``-l`` in that case is known as an optional argument.
|
||||
|
||||
* That's a snippet of the help text. It's very useful in that you can
|
||||
come across a program you have never used before, and can figure out
|
||||
how it works simply by reading its help text.
|
||||
|
||||
|
||||
The basics
|
||||
==========
|
||||
|
||||
Let us start with a very simple example which does (almost) nothing::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.parse_args()
|
||||
|
||||
Following is a result of running the code:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py
|
||||
$ python prog.py --help
|
||||
usage: prog.py [-h]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
$ python prog.py --verbose
|
||||
usage: prog.py [-h]
|
||||
prog.py: error: unrecognized arguments: --verbose
|
||||
$ python prog.py foo
|
||||
usage: prog.py [-h]
|
||||
prog.py: error: unrecognized arguments: foo
|
||||
|
||||
Here is what is happening:
|
||||
|
||||
* Running the script without any options results in nothing displayed to
|
||||
stdout. Not so useful.
|
||||
|
||||
* The second one starts to display the usefulness of the :mod:`argparse`
|
||||
module. We have done almost nothing, but already we get a nice help message.
|
||||
|
||||
* The ``--help`` option, which can also be shortened to ``-h``, is the only
|
||||
option we get for free (i.e. no need to specify it). Specifying anything
|
||||
else results in an error. But even then, we do get a useful usage message,
|
||||
also for free.
|
||||
|
||||
|
||||
Introducing Positional arguments
|
||||
================================
|
||||
|
||||
An example::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("echo")
|
||||
args = parser.parse_args()
|
||||
print args.echo
|
||||
|
||||
And running the code:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py
|
||||
usage: prog.py [-h] echo
|
||||
prog.py: error: the following arguments are required: echo
|
||||
$ python prog.py --help
|
||||
usage: prog.py [-h] echo
|
||||
|
||||
positional arguments:
|
||||
echo
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
$ python prog.py foo
|
||||
foo
|
||||
|
||||
Here is what's happening:
|
||||
|
||||
* We've added the :meth:`add_argument` method, which is what we use to specify
|
||||
which command-line options the program is willing to accept. In this case,
|
||||
I've named it ``echo`` so that it's in line with its function.
|
||||
|
||||
* Calling our program now requires us to specify an option.
|
||||
|
||||
* The :meth:`parse_args` method actually returns some data from the
|
||||
options specified, in this case, ``echo``.
|
||||
|
||||
* The variable is some form of 'magic' that :mod:`argparse` performs for free
|
||||
(i.e. no need to specify which variable that value is stored in).
|
||||
You will also notice that its name matches the string argument given
|
||||
to the method, ``echo``.
|
||||
|
||||
Note however that, although the help display looks nice and all, it currently
|
||||
is not as helpful as it can be. For example we see that we got ``echo`` as a
|
||||
positional argument, but we don't know what it does, other than by guessing or
|
||||
by reading the source code. So, let's make it a bit more useful::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("echo", help="echo the string you use here")
|
||||
args = parser.parse_args()
|
||||
print args.echo
|
||||
|
||||
And we get:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py -h
|
||||
usage: prog.py [-h] echo
|
||||
|
||||
positional arguments:
|
||||
echo echo the string you use here
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
|
||||
Now, how about doing something even more useful::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", help="display a square of a given number")
|
||||
args = parser.parse_args()
|
||||
print args.square**2
|
||||
|
||||
Following is a result of running the code:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4
|
||||
Traceback (most recent call last):
|
||||
File "prog.py", line 5, in <module>
|
||||
print args.square**2
|
||||
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
|
||||
|
||||
That didn't go so well. That's because :mod:`argparse` treats the options we
|
||||
give it as strings, unless we tell it otherwise. So, let's tell
|
||||
:mod:`argparse` to treat that input as an integer::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", help="display a square of a given number",
|
||||
type=int)
|
||||
args = parser.parse_args()
|
||||
print args.square**2
|
||||
|
||||
Following is a result of running the code:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4
|
||||
16
|
||||
$ python prog.py four
|
||||
usage: prog.py [-h] square
|
||||
prog.py: error: argument square: invalid int value: 'four'
|
||||
|
||||
That went well. The program now even helpfully quits on bad illegal input
|
||||
before proceeding.
|
||||
|
||||
|
||||
Introducing Optional arguments
|
||||
==============================
|
||||
|
||||
So far we have been playing with positional arguments. Let us
|
||||
have a look on how to add optional ones::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--verbosity", help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
if args.verbosity:
|
||||
print "verbosity turned on"
|
||||
|
||||
And the output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py --verbosity 1
|
||||
verbosity turned on
|
||||
$ python prog.py
|
||||
$ python prog.py --help
|
||||
usage: prog.py [-h] [--verbosity VERBOSITY]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--verbosity VERBOSITY
|
||||
increase output verbosity
|
||||
$ python prog.py --verbosity
|
||||
usage: prog.py [-h] [--verbosity VERBOSITY]
|
||||
prog.py: error: argument --verbosity: expected one argument
|
||||
|
||||
Here is what is happening:
|
||||
|
||||
* The program is written so as to display something when ``--verbosity`` is
|
||||
specified and display nothing when not.
|
||||
|
||||
* To show that the option is actually optional, there is no error when running
|
||||
the program without it. Note that by default, if an optional argument isn't
|
||||
used, the relevant variable, in this case :attr:`args.verbosity`, is
|
||||
given ``None`` as a value, which is the reason it fails the truth
|
||||
test of the :keyword:`if` statement.
|
||||
|
||||
* The help message is a bit different.
|
||||
|
||||
* When using the ``--verbosity`` option, one must also specify some value,
|
||||
any value.
|
||||
|
||||
The above example accepts arbitrary integer values for ``--verbosity``, but for
|
||||
our simple program, only two values are actually useful, ``True`` or ``False``.
|
||||
Let's modify the code accordingly::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--verbose", help="increase output verbosity",
|
||||
action="store_true")
|
||||
args = parser.parse_args()
|
||||
if args.verbose:
|
||||
print "verbosity turned on"
|
||||
|
||||
And the output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py --verbose
|
||||
verbosity turned on
|
||||
$ python prog.py --verbose 1
|
||||
usage: prog.py [-h] [--verbose]
|
||||
prog.py: error: unrecognized arguments: 1
|
||||
$ python prog.py --help
|
||||
usage: prog.py [-h] [--verbose]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--verbose increase output verbosity
|
||||
|
||||
Here is what is happening:
|
||||
|
||||
* The option is now more of a flag than something that requires a value.
|
||||
We even changed the name of the option to match that idea.
|
||||
Note that we now specify a new keyword, ``action``, and give it the value
|
||||
``"store_true"``. This means that, if the option is specified,
|
||||
assign the value ``True`` to :data:`args.verbose`.
|
||||
Not specifying it implies ``False``.
|
||||
|
||||
* It complains when you specify a value, in true spirit of what flags
|
||||
actually are.
|
||||
|
||||
* Notice the different help text.
|
||||
|
||||
|
||||
Short options
|
||||
-------------
|
||||
|
||||
If you are familiar with command line usage,
|
||||
you will notice that I haven't yet touched on the topic of short
|
||||
versions of the options. It's quite simple::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("-v", "--verbose", help="increase output verbosity",
|
||||
action="store_true")
|
||||
args = parser.parse_args()
|
||||
if args.verbose:
|
||||
print "verbosity turned on"
|
||||
|
||||
And here goes:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py -v
|
||||
verbosity turned on
|
||||
$ python prog.py --help
|
||||
usage: prog.py [-h] [-v]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v, --verbose increase output verbosity
|
||||
|
||||
Note that the new ability is also reflected in the help text.
|
||||
|
||||
|
||||
Combining Positional and Optional arguments
|
||||
===========================================
|
||||
|
||||
Our program keeps growing in complexity::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", type=int,
|
||||
help="display a square of a given number")
|
||||
parser.add_argument("-v", "--verbose", action="store_true",
|
||||
help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
answer = args.square**2
|
||||
if args.verbose:
|
||||
print "the square of {} equals {}".format(args.square, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
And now the output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py
|
||||
usage: prog.py [-h] [-v] square
|
||||
prog.py: error: the following arguments are required: square
|
||||
$ python prog.py 4
|
||||
16
|
||||
$ python prog.py 4 --verbose
|
||||
the square of 4 equals 16
|
||||
$ python prog.py --verbose 4
|
||||
the square of 4 equals 16
|
||||
|
||||
* We've brought back a positional argument, hence the complaint.
|
||||
|
||||
* Note that the order does not matter.
|
||||
|
||||
How about we give this program of ours back the ability to have
|
||||
multiple verbosity values, and actually get to use them::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", type=int,
|
||||
help="display a square of a given number")
|
||||
parser.add_argument("-v", "--verbosity", type=int,
|
||||
help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
answer = args.square**2
|
||||
if args.verbosity == 2:
|
||||
print "the square of {} equals {}".format(args.square, answer)
|
||||
elif args.verbosity == 1:
|
||||
print "{}^2 == {}".format(args.square, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
And the output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4
|
||||
16
|
||||
$ python prog.py 4 -v
|
||||
usage: prog.py [-h] [-v VERBOSITY] square
|
||||
prog.py: error: argument -v/--verbosity: expected one argument
|
||||
$ python prog.py 4 -v 1
|
||||
4^2 == 16
|
||||
$ python prog.py 4 -v 2
|
||||
the square of 4 equals 16
|
||||
$ python prog.py 4 -v 3
|
||||
16
|
||||
|
||||
These all look good except the last one, which exposes a bug in our program.
|
||||
Let's fix it by restricting the values the ``--verbosity`` option can accept::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", type=int,
|
||||
help="display a square of a given number")
|
||||
parser.add_argument("-v", "--verbosity", type=int, choices=[0, 1, 2],
|
||||
help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
answer = args.square**2
|
||||
if args.verbosity == 2:
|
||||
print "the square of {} equals {}".format(args.square, answer)
|
||||
elif args.verbosity == 1:
|
||||
print "{}^2 == {}".format(args.square, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
And the output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4 -v 3
|
||||
usage: prog.py [-h] [-v {0,1,2}] square
|
||||
prog.py: error: argument -v/--verbosity: invalid choice: 3 (choose from 0, 1, 2)
|
||||
$ python prog.py 4 -h
|
||||
usage: prog.py [-h] [-v {0,1,2}] square
|
||||
|
||||
positional arguments:
|
||||
square display a square of a given number
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v {0,1,2}, --verbosity {0,1,2}
|
||||
increase output verbosity
|
||||
|
||||
Note that the change also reflects both in the error message as well as the
|
||||
help string.
|
||||
|
||||
Now, let's use a different approach of playing with verbosity, which is pretty
|
||||
common. It also matches the way the CPython executable handles its own
|
||||
verbosity argument (check the output of ``python --help``)::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", type=int,
|
||||
help="display the square of a given number")
|
||||
parser.add_argument("-v", "--verbosity", action="count",
|
||||
help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
answer = args.square**2
|
||||
if args.verbosity == 2:
|
||||
print "the square of {} equals {}".format(args.square, answer)
|
||||
elif args.verbosity == 1:
|
||||
print "{}^2 == {}".format(args.square, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
We have introduced another action, "count",
|
||||
to count the number of occurrences of a specific optional arguments:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4
|
||||
16
|
||||
$ python prog.py 4 -v
|
||||
4^2 == 16
|
||||
$ python prog.py 4 -vv
|
||||
the square of 4 equals 16
|
||||
$ python prog.py 4 --verbosity --verbosity
|
||||
the square of 4 equals 16
|
||||
$ python prog.py 4 -v 1
|
||||
usage: prog.py [-h] [-v] square
|
||||
prog.py: error: unrecognized arguments: 1
|
||||
$ python prog.py 4 -h
|
||||
usage: prog.py [-h] [-v] square
|
||||
|
||||
positional arguments:
|
||||
square display a square of a given number
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v, --verbosity increase output verbosity
|
||||
$ python prog.py 4 -vvv
|
||||
16
|
||||
|
||||
* Yes, it's now more of a flag (similar to ``action="store_true"``) in the
|
||||
previous version of our script. That should explain the complaint.
|
||||
|
||||
* It also behaves similar to "store_true" action.
|
||||
|
||||
* Now here's a demonstration of what the "count" action gives. You've probably
|
||||
seen this sort of usage before.
|
||||
|
||||
* And, just like the "store_true" action, if you don't specify the ``-v`` flag,
|
||||
that flag is considered to have ``None`` value.
|
||||
|
||||
* As should be expected, specifying the long form of the flag, we should get
|
||||
the same output.
|
||||
|
||||
* Sadly, our help output isn't very informative on the new ability our script
|
||||
has acquired, but that can always be fixed by improving the documentation for
|
||||
our script (e.g. via the ``help`` keyword argument).
|
||||
|
||||
* That last output exposes a bug in our program.
|
||||
|
||||
|
||||
Let's fix::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", type=int,
|
||||
help="display a square of a given number")
|
||||
parser.add_argument("-v", "--verbosity", action="count",
|
||||
help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
answer = args.square**2
|
||||
|
||||
# bugfix: replace == with >=
|
||||
if args.verbosity >= 2:
|
||||
print "the square of {} equals {}".format(args.square, answer)
|
||||
elif args.verbosity >= 1:
|
||||
print "{}^2 == {}".format(args.square, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
And this is what it gives:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4 -vvv
|
||||
the square of 4 equals 16
|
||||
$ python prog.py 4 -vvvv
|
||||
the square of 4 equals 16
|
||||
$ python prog.py 4
|
||||
Traceback (most recent call last):
|
||||
File "prog.py", line 11, in <module>
|
||||
if args.verbosity >= 2:
|
||||
TypeError: unorderable types: NoneType() >= int()
|
||||
|
||||
* First output went well, and fixes the bug we had before.
|
||||
That is, we want any value >= 2 to be as verbose as possible.
|
||||
|
||||
* Third output not so good.
|
||||
|
||||
Let's fix that bug::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("square", type=int,
|
||||
help="display a square of a given number")
|
||||
parser.add_argument("-v", "--verbosity", action="count", default=0,
|
||||
help="increase output verbosity")
|
||||
args = parser.parse_args()
|
||||
answer = args.square**2
|
||||
if args.verbosity >= 2:
|
||||
print "the square of {} equals {}".format(args.square, answer)
|
||||
elif args.verbosity >= 1:
|
||||
print "{}^2 == {}".format(args.square, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
We've just introduced yet another keyword, ``default``.
|
||||
We've set it to ``0`` in order to make it comparable to the other int values.
|
||||
Remember that by default,
|
||||
if an optional argument isn't specified,
|
||||
it gets the ``None`` value, and that cannot be compared to an int value
|
||||
(hence the :exc:`TypeError` exception).
|
||||
|
||||
And:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4
|
||||
16
|
||||
|
||||
You can go quite far just with what we've learned so far,
|
||||
and we have only scratched the surface.
|
||||
The :mod:`argparse` module is very powerful,
|
||||
and we'll explore a bit more of it before we end this tutorial.
|
||||
|
||||
|
||||
Getting a little more advanced
|
||||
==============================
|
||||
|
||||
What if we wanted to expand our tiny program to perform other powers,
|
||||
not just squares::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("x", type=int, help="the base")
|
||||
parser.add_argument("y", type=int, help="the exponent")
|
||||
parser.add_argument("-v", "--verbosity", action="count", default=0)
|
||||
args = parser.parse_args()
|
||||
answer = args.x**args.y
|
||||
if args.verbosity >= 2:
|
||||
print "{} to the power {} equals {}".format(args.x, args.y, answer)
|
||||
elif args.verbosity >= 1:
|
||||
print "{}^{} == {}".format(args.x, args.y, answer)
|
||||
else:
|
||||
print answer
|
||||
|
||||
Output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py
|
||||
usage: prog.py [-h] [-v] x y
|
||||
prog.py: error: the following arguments are required: x, y
|
||||
$ python prog.py -h
|
||||
usage: prog.py [-h] [-v] x y
|
||||
|
||||
positional arguments:
|
||||
x the base
|
||||
y the exponent
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v, --verbosity
|
||||
$ python prog.py 4 2 -v
|
||||
4^2 == 16
|
||||
|
||||
|
||||
Notice that so far we've been using verbosity level to *change* the text
|
||||
that gets displayed. The following example instead uses verbosity level
|
||||
to display *more* text instead::
|
||||
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("x", type=int, help="the base")
|
||||
parser.add_argument("y", type=int, help="the exponent")
|
||||
parser.add_argument("-v", "--verbosity", action="count", default=0)
|
||||
args = parser.parse_args()
|
||||
answer = args.x**args.y
|
||||
if args.verbosity >= 2:
|
||||
print "Running '{}'".format(__file__)
|
||||
if args.verbosity >= 1:
|
||||
print "{}^{} ==".format(args.x, args.y),
|
||||
print answer
|
||||
|
||||
Output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4 2
|
||||
16
|
||||
$ python prog.py 4 2 -v
|
||||
4^2 == 16
|
||||
$ python prog.py 4 2 -vv
|
||||
Running 'prog.py'
|
||||
4^2 == 16
|
||||
|
||||
|
||||
Conflicting options
|
||||
-------------------
|
||||
|
||||
So far, we have been working with two methods of an
|
||||
:class:`argparse.ArgumentParser` instance. Let's introduce a third one,
|
||||
:meth:`add_mutually_exclusive_group`. It allows for us to specify options that
|
||||
conflict with each other. Let's also change the rest of the program so that
|
||||
the new functionality makes more sense:
|
||||
we'll introduce the ``--quiet`` option,
|
||||
which will be the opposite of the ``--verbose`` one::
|
||||
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument("-v", "--verbose", action="store_true")
|
||||
group.add_argument("-q", "--quiet", action="store_true")
|
||||
parser.add_argument("x", type=int, help="the base")
|
||||
parser.add_argument("y", type=int, help="the exponent")
|
||||
args = parser.parse_args()
|
||||
answer = args.x**args.y
|
||||
|
||||
if args.quiet:
|
||||
print answer
|
||||
elif args.verbose:
|
||||
print "{} to the power {} equals {}".format(args.x, args.y, answer)
|
||||
else:
|
||||
print "{}^{} == {}".format(args.x, args.y, answer)
|
||||
|
||||
Our program is now simpler, and we've lost some functionality for the sake of
|
||||
demonstration. Anyways, here's the output:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py 4 2
|
||||
4^2 == 16
|
||||
$ python prog.py 4 2 -q
|
||||
16
|
||||
$ python prog.py 4 2 -v
|
||||
4 to the power 2 equals 16
|
||||
$ python prog.py 4 2 -vq
|
||||
usage: prog.py [-h] [-v | -q] x y
|
||||
prog.py: error: argument -q/--quiet: not allowed with argument -v/--verbose
|
||||
$ python prog.py 4 2 -v --quiet
|
||||
usage: prog.py [-h] [-v | -q] x y
|
||||
prog.py: error: argument -q/--quiet: not allowed with argument -v/--verbose
|
||||
|
||||
That should be easy to follow. I've added that last output so you can see the
|
||||
sort of flexibility you get, i.e. mixing long form options with short form
|
||||
ones.
|
||||
|
||||
Before we conclude, you probably want to tell your users the main purpose of
|
||||
your program, just in case they don't know::
|
||||
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="calculate X to the power of Y")
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument("-v", "--verbose", action="store_true")
|
||||
group.add_argument("-q", "--quiet", action="store_true")
|
||||
parser.add_argument("x", type=int, help="the base")
|
||||
parser.add_argument("y", type=int, help="the exponent")
|
||||
args = parser.parse_args()
|
||||
answer = args.x**args.y
|
||||
|
||||
if args.quiet:
|
||||
print answer
|
||||
elif args.verbose:
|
||||
print "{} to the power {} equals {}".format(args.x, args.y, answer)
|
||||
else:
|
||||
print "{}^{} == {}".format(args.x, args.y, answer)
|
||||
|
||||
Note that slight difference in the usage text. Note the ``[-v | -q]``,
|
||||
which tells us that we can either use ``-v`` or ``-q``,
|
||||
but not both at the same time:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ python prog.py --help
|
||||
usage: prog.py [-h] [-v | -q] x y
|
||||
|
||||
calculate X to the power of Y
|
||||
|
||||
positional arguments:
|
||||
x the base
|
||||
y the exponent
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v, --verbose
|
||||
-q, --quiet
|
||||
|
||||
|
||||
Conclusion
|
||||
==========
|
||||
|
||||
The :mod:`argparse` module offers a lot more than shown here.
|
||||
Its docs are quite detailed and thorough, and full of examples.
|
||||
Having gone through this tutorial, you should easily digest them
|
||||
without feeling overwhelmed.
|
||||
257
Doc/howto/cporting.rst
Normal file
257
Doc/howto/cporting.rst
Normal file
@@ -0,0 +1,257 @@
|
||||
.. highlightlang:: c
|
||||
|
||||
.. _cporting-howto:
|
||||
|
||||
*************************************
|
||||
Porting Extension Modules to Python 3
|
||||
*************************************
|
||||
|
||||
:author: Benjamin Peterson
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
Although changing the C-API was not one of Python 3's objectives,
|
||||
the many Python-level changes made leaving Python 2's API intact
|
||||
impossible. In fact, some changes such as :func:`int` and
|
||||
:func:`long` unification are more obvious on the C level. This
|
||||
document endeavors to document incompatibilities and how they can
|
||||
be worked around.
|
||||
|
||||
|
||||
Conditional compilation
|
||||
=======================
|
||||
|
||||
The easiest way to compile only some code for Python 3 is to check
|
||||
if :c:macro:`PY_MAJOR_VERSION` is greater than or equal to 3. ::
|
||||
|
||||
#if PY_MAJOR_VERSION >= 3
|
||||
#define IS_PY3K
|
||||
#endif
|
||||
|
||||
API functions that are not present can be aliased to their equivalents within
|
||||
conditional blocks.
|
||||
|
||||
|
||||
Changes to Object APIs
|
||||
======================
|
||||
|
||||
Python 3 merged together some types with similar functions while cleanly
|
||||
separating others.
|
||||
|
||||
|
||||
str/unicode Unification
|
||||
-----------------------
|
||||
|
||||
Python 3's :func:`str` type is equivalent to Python 2's :func:`unicode`; the C
|
||||
functions are called ``PyUnicode_*`` for both. The old 8-bit string type has become
|
||||
:func:`bytes`, with C functions called ``PyBytes_*``. Python 2.6 and later provide a compatibility header,
|
||||
:file:`bytesobject.h`, mapping ``PyBytes`` names to ``PyString`` ones. For best
|
||||
compatibility with Python 3, :c:type:`PyUnicode` should be used for textual data and
|
||||
:c:type:`PyBytes` for binary data. It's also important to remember that
|
||||
:c:type:`PyBytes` and :c:type:`PyUnicode` in Python 3 are not interchangeable like
|
||||
:c:type:`PyString` and :c:type:`PyUnicode` are in Python 2. The following example
|
||||
shows best practices with regards to :c:type:`PyUnicode`, :c:type:`PyString`,
|
||||
and :c:type:`PyBytes`. ::
|
||||
|
||||
#include "stdlib.h"
|
||||
#include "Python.h"
|
||||
#include "bytesobject.h"
|
||||
|
||||
/* text example */
|
||||
static PyObject *
|
||||
say_hello(PyObject *self, PyObject *args) {
|
||||
PyObject *name, *result;
|
||||
|
||||
if (!PyArg_ParseTuple(args, "U:say_hello", &name))
|
||||
return NULL;
|
||||
|
||||
result = PyUnicode_FromFormat("Hello, %S!", name);
|
||||
return result;
|
||||
}
|
||||
|
||||
/* just a forward */
|
||||
static char * do_encode(PyObject *);
|
||||
|
||||
/* bytes example */
|
||||
static PyObject *
|
||||
encode_object(PyObject *self, PyObject *args) {
|
||||
char *encoded;
|
||||
PyObject *result, *myobj;
|
||||
|
||||
if (!PyArg_ParseTuple(args, "O:encode_object", &myobj))
|
||||
return NULL;
|
||||
|
||||
encoded = do_encode(myobj);
|
||||
if (encoded == NULL)
|
||||
return NULL;
|
||||
result = PyBytes_FromString(encoded);
|
||||
free(encoded);
|
||||
return result;
|
||||
}
|
||||
|
||||
|
||||
long/int Unification
|
||||
--------------------
|
||||
|
||||
Python 3 has only one integer type, :func:`int`. But it actually
|
||||
corresponds to Python 2's :func:`long` type—the :func:`int` type
|
||||
used in Python 2 was removed. In the C-API, ``PyInt_*`` functions
|
||||
are replaced by their ``PyLong_*`` equivalents.
|
||||
|
||||
|
||||
Module initialization and state
|
||||
===============================
|
||||
|
||||
Python 3 has a revamped extension module initialization system. (See
|
||||
:pep:`3121`.) Instead of storing module state in globals, they should
|
||||
be stored in an interpreter specific structure. Creating modules that
|
||||
act correctly in both Python 2 and Python 3 is tricky. The following
|
||||
simple example demonstrates how. ::
|
||||
|
||||
#include "Python.h"
|
||||
|
||||
struct module_state {
|
||||
PyObject *error;
|
||||
};
|
||||
|
||||
#if PY_MAJOR_VERSION >= 3
|
||||
#define GETSTATE(m) ((struct module_state*)PyModule_GetState(m))
|
||||
#else
|
||||
#define GETSTATE(m) (&_state)
|
||||
static struct module_state _state;
|
||||
#endif
|
||||
|
||||
static PyObject *
|
||||
error_out(PyObject *m) {
|
||||
struct module_state *st = GETSTATE(m);
|
||||
PyErr_SetString(st->error, "something bad happened");
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static PyMethodDef myextension_methods[] = {
|
||||
{"error_out", (PyCFunction)error_out, METH_NOARGS, NULL},
|
||||
{NULL, NULL}
|
||||
};
|
||||
|
||||
#if PY_MAJOR_VERSION >= 3
|
||||
|
||||
static int myextension_traverse(PyObject *m, visitproc visit, void *arg) {
|
||||
Py_VISIT(GETSTATE(m)->error);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int myextension_clear(PyObject *m) {
|
||||
Py_CLEAR(GETSTATE(m)->error);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
static struct PyModuleDef moduledef = {
|
||||
PyModuleDef_HEAD_INIT,
|
||||
"myextension",
|
||||
NULL,
|
||||
sizeof(struct module_state),
|
||||
myextension_methods,
|
||||
NULL,
|
||||
myextension_traverse,
|
||||
myextension_clear,
|
||||
NULL
|
||||
};
|
||||
|
||||
#define INITERROR return NULL
|
||||
|
||||
PyMODINIT_FUNC
|
||||
PyInit_myextension(void)
|
||||
|
||||
#else
|
||||
#define INITERROR return
|
||||
|
||||
void
|
||||
initmyextension(void)
|
||||
#endif
|
||||
{
|
||||
#if PY_MAJOR_VERSION >= 3
|
||||
PyObject *module = PyModule_Create(&moduledef);
|
||||
#else
|
||||
PyObject *module = Py_InitModule("myextension", myextension_methods);
|
||||
#endif
|
||||
|
||||
if (module == NULL)
|
||||
INITERROR;
|
||||
struct module_state *st = GETSTATE(module);
|
||||
|
||||
st->error = PyErr_NewException("myextension.Error", NULL, NULL);
|
||||
if (st->error == NULL) {
|
||||
Py_DECREF(module);
|
||||
INITERROR;
|
||||
}
|
||||
|
||||
#if PY_MAJOR_VERSION >= 3
|
||||
return module;
|
||||
#endif
|
||||
}
|
||||
|
||||
|
||||
CObject replaced with Capsule
|
||||
=============================
|
||||
|
||||
The :c:type:`Capsule` object was introduced in Python 3.1 and 2.7 to replace
|
||||
:c:type:`CObject`. CObjects were useful,
|
||||
but the :c:type:`CObject` API was problematic: it didn't permit distinguishing
|
||||
between valid CObjects, which allowed mismatched CObjects to crash the
|
||||
interpreter, and some of its APIs relied on undefined behavior in C.
|
||||
(For further reading on the rationale behind Capsules, please see :issue:`5630`.)
|
||||
|
||||
If you're currently using CObjects, and you want to migrate to 3.1 or newer,
|
||||
you'll need to switch to Capsules.
|
||||
:c:type:`CObject` was deprecated in 3.1 and 2.7 and completely removed in
|
||||
Python 3.2. If you only support 2.7, or 3.1 and above, you
|
||||
can simply switch to :c:type:`Capsule`. If you need to support Python 3.0,
|
||||
or versions of Python earlier than 2.7,
|
||||
you'll have to support both CObjects and Capsules.
|
||||
(Note that Python 3.0 is no longer supported, and it is not recommended
|
||||
for production use.)
|
||||
|
||||
The following example header file :file:`capsulethunk.h` may
|
||||
solve the problem for you. Simply write your code against the
|
||||
:c:type:`Capsule` API and include this header file after
|
||||
:file:`Python.h`. Your code will automatically use Capsules
|
||||
in versions of Python with Capsules, and switch to CObjects
|
||||
when Capsules are unavailable.
|
||||
|
||||
:file:`capsulethunk.h` simulates Capsules using CObjects. However,
|
||||
:c:type:`CObject` provides no place to store the capsule's "name". As a
|
||||
result the simulated :c:type:`Capsule` objects created by :file:`capsulethunk.h`
|
||||
behave slightly differently from real Capsules. Specifically:
|
||||
|
||||
* The name parameter passed in to :c:func:`PyCapsule_New` is ignored.
|
||||
|
||||
* The name parameter passed in to :c:func:`PyCapsule_IsValid` and
|
||||
:c:func:`PyCapsule_GetPointer` is ignored, and no error checking
|
||||
of the name is performed.
|
||||
|
||||
* :c:func:`PyCapsule_GetName` always returns NULL.
|
||||
|
||||
* :c:func:`PyCapsule_SetName` always raises an exception and
|
||||
returns failure. (Since there's no way to store a name
|
||||
in a CObject, noisy failure of :c:func:`PyCapsule_SetName`
|
||||
was deemed preferable to silent failure here. If this is
|
||||
inconvenient, feel free to modify your local
|
||||
copy as you see fit.)
|
||||
|
||||
You can find :file:`capsulethunk.h` in the Python source distribution
|
||||
as :source:`Doc/includes/capsulethunk.h`. We also include it here for
|
||||
your convenience:
|
||||
|
||||
.. literalinclude:: ../includes/capsulethunk.h
|
||||
|
||||
|
||||
|
||||
Other options
|
||||
=============
|
||||
|
||||
If you are writing a new extension module, you might consider `Cython
|
||||
<http://cython.org/>`_. It translates a Python-like language to C. The
|
||||
extension modules it creates are compatible with Python 3 and Python 2.
|
||||
|
||||
439
Doc/howto/curses.rst
Normal file
439
Doc/howto/curses.rst
Normal file
@@ -0,0 +1,439 @@
|
||||
.. _curses-howto:
|
||||
|
||||
**********************************
|
||||
Curses Programming with Python
|
||||
**********************************
|
||||
|
||||
:Author: A.M. Kuchling, Eric S. Raymond
|
||||
:Release: 2.03
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
This document describes how to write text-mode programs with Python 2.x, using
|
||||
the :mod:`curses` extension module to control the display.
|
||||
|
||||
|
||||
What is curses?
|
||||
===============
|
||||
|
||||
The curses library supplies a terminal-independent screen-painting and
|
||||
keyboard-handling facility for text-based terminals; such terminals include
|
||||
VT100s, the Linux console, and the simulated terminal provided by X11 programs
|
||||
such as xterm and rxvt. Display terminals support various control codes to
|
||||
perform common operations such as moving the cursor, scrolling the screen, and
|
||||
erasing areas. Different terminals use widely differing codes, and often have
|
||||
their own minor quirks.
|
||||
|
||||
In a world of X displays, one might ask "why bother"? It's true that
|
||||
character-cell display terminals are an obsolete technology, but there are
|
||||
niches in which being able to do fancy things with them are still valuable. One
|
||||
is on small-footprint or embedded Unixes that don't carry an X server. Another
|
||||
is for tools like OS installers and kernel configurators that may have to run
|
||||
before X is available.
|
||||
|
||||
The curses library hides all the details of different terminals, and provides
|
||||
the programmer with an abstraction of a display, containing multiple
|
||||
non-overlapping windows. The contents of a window can be changed in various
|
||||
ways---adding text, erasing it, changing its appearance---and the curses library
|
||||
will automagically figure out what control codes need to be sent to the terminal
|
||||
to produce the right output.
|
||||
|
||||
The curses library was originally written for BSD Unix; the later System V
|
||||
versions of Unix from AT&T added many enhancements and new functions. BSD curses
|
||||
is no longer maintained, having been replaced by ncurses, which is an
|
||||
open-source implementation of the AT&T interface. If you're using an
|
||||
open-source Unix such as Linux or FreeBSD, your system almost certainly uses
|
||||
ncurses. Since most current commercial Unix versions are based on System V
|
||||
code, all the functions described here will probably be available. The older
|
||||
versions of curses carried by some proprietary Unixes may not support
|
||||
everything, though.
|
||||
|
||||
No one has made a Windows port of the curses module. On a Windows platform, try
|
||||
the Console module written by Fredrik Lundh. The Console module provides
|
||||
cursor-addressable text output, plus full support for mouse and keyboard input,
|
||||
and is available from http://effbot.org/zone/console-index.htm.
|
||||
|
||||
|
||||
The Python curses module
|
||||
------------------------
|
||||
|
||||
Thy Python module is a fairly simple wrapper over the C functions provided by
|
||||
curses; if you're already familiar with curses programming in C, it's really
|
||||
easy to transfer that knowledge to Python. The biggest difference is that the
|
||||
Python interface makes things simpler, by merging different C functions such as
|
||||
:func:`addstr`, :func:`mvaddstr`, :func:`mvwaddstr`, into a single
|
||||
:meth:`addstr` method. You'll see this covered in more detail later.
|
||||
|
||||
This HOWTO is simply an introduction to writing text-mode programs with curses
|
||||
and Python. It doesn't attempt to be a complete guide to the curses API; for
|
||||
that, see the Python library guide's section on ncurses, and the C manual pages
|
||||
for ncurses. It will, however, give you the basic ideas.
|
||||
|
||||
|
||||
Starting and ending a curses application
|
||||
========================================
|
||||
|
||||
Before doing anything, curses must be initialized. This is done by calling the
|
||||
:func:`initscr` function, which will determine the terminal type, send any
|
||||
required setup codes to the terminal, and create various internal data
|
||||
structures. If successful, :func:`initscr` returns a window object representing
|
||||
the entire screen; this is usually called ``stdscr``, after the name of the
|
||||
corresponding C variable. ::
|
||||
|
||||
import curses
|
||||
stdscr = curses.initscr()
|
||||
|
||||
Usually curses applications turn off automatic echoing of keys to the screen, in
|
||||
order to be able to read keys and only display them under certain circumstances.
|
||||
This requires calling the :func:`noecho` function. ::
|
||||
|
||||
curses.noecho()
|
||||
|
||||
Applications will also commonly need to react to keys instantly, without
|
||||
requiring the Enter key to be pressed; this is called cbreak mode, as opposed to
|
||||
the usual buffered input mode. ::
|
||||
|
||||
curses.cbreak()
|
||||
|
||||
Terminals usually return special keys, such as the cursor keys or navigation
|
||||
keys such as Page Up and Home, as a multibyte escape sequence. While you could
|
||||
write your application to expect such sequences and process them accordingly,
|
||||
curses can do it for you, returning a special value such as
|
||||
:const:`curses.KEY_LEFT`. To get curses to do the job, you'll have to enable
|
||||
keypad mode. ::
|
||||
|
||||
stdscr.keypad(1)
|
||||
|
||||
Terminating a curses application is much easier than starting one. You'll need
|
||||
to call ::
|
||||
|
||||
curses.nocbreak(); stdscr.keypad(0); curses.echo()
|
||||
|
||||
to reverse the curses-friendly terminal settings. Then call the :func:`endwin`
|
||||
function to restore the terminal to its original operating mode. ::
|
||||
|
||||
curses.endwin()
|
||||
|
||||
A common problem when debugging a curses application is to get your terminal
|
||||
messed up when the application dies without restoring the terminal to its
|
||||
previous state. In Python this commonly happens when your code is buggy and
|
||||
raises an uncaught exception. Keys are no longer echoed to the screen when
|
||||
you type them, for example, which makes using the shell difficult.
|
||||
|
||||
In Python you can avoid these complications and make debugging much easier by
|
||||
importing the :func:`curses.wrapper` function. It takes a callable and does
|
||||
the initializations described above, also initializing colors if color support
|
||||
is present. It then runs your provided callable and finally deinitializes
|
||||
appropriately. The callable is called inside a try-catch clause which catches
|
||||
exceptions, performs curses deinitialization, and then passes the exception
|
||||
upwards. Thus, your terminal won't be left in a funny state on exception.
|
||||
|
||||
|
||||
Windows and Pads
|
||||
================
|
||||
|
||||
Windows are the basic abstraction in curses. A window object represents a
|
||||
rectangular area of the screen, and supports various methods to display text,
|
||||
erase it, allow the user to input strings, and so forth.
|
||||
|
||||
The ``stdscr`` object returned by the :func:`initscr` function is a window
|
||||
object that covers the entire screen. Many programs may need only this single
|
||||
window, but you might wish to divide the screen into smaller windows, in order
|
||||
to redraw or clear them separately. The :func:`newwin` function creates a new
|
||||
window of a given size, returning the new window object. ::
|
||||
|
||||
begin_x = 20; begin_y = 7
|
||||
height = 5; width = 40
|
||||
win = curses.newwin(height, width, begin_y, begin_x)
|
||||
|
||||
A word about the coordinate system used in curses: coordinates are always passed
|
||||
in the order *y,x*, and the top-left corner of a window is coordinate (0,0).
|
||||
This breaks a common convention for handling coordinates, where the *x*
|
||||
coordinate usually comes first. This is an unfortunate difference from most
|
||||
other computer applications, but it's been part of curses since it was first
|
||||
written, and it's too late to change things now.
|
||||
|
||||
When you call a method to display or erase text, the effect doesn't immediately
|
||||
show up on the display. This is because curses was originally written with slow
|
||||
300-baud terminal connections in mind; with these terminals, minimizing the time
|
||||
required to redraw the screen is very important. This lets curses accumulate
|
||||
changes to the screen, and display them in the most efficient manner. For
|
||||
example, if your program displays some characters in a window, and then clears
|
||||
the window, there's no need to send the original characters because they'd never
|
||||
be visible.
|
||||
|
||||
Accordingly, curses requires that you explicitly tell it to redraw windows,
|
||||
using the :func:`refresh` method of window objects. In practice, this doesn't
|
||||
really complicate programming with curses much. Most programs go into a flurry
|
||||
of activity, and then pause waiting for a keypress or some other action on the
|
||||
part of the user. All you have to do is to be sure that the screen has been
|
||||
redrawn before pausing to wait for user input, by simply calling
|
||||
``stdscr.refresh()`` or the :func:`refresh` method of some other relevant
|
||||
window.
|
||||
|
||||
A pad is a special case of a window; it can be larger than the actual display
|
||||
screen, and only a portion of it displayed at a time. Creating a pad simply
|
||||
requires the pad's height and width, while refreshing a pad requires giving the
|
||||
coordinates of the on-screen area where a subsection of the pad will be
|
||||
displayed. ::
|
||||
|
||||
pad = curses.newpad(100, 100)
|
||||
# These loops fill the pad with letters; this is
|
||||
# explained in the next section
|
||||
for y in range(0, 100):
|
||||
for x in range(0, 100):
|
||||
try:
|
||||
pad.addch(y,x, ord('a') + (x*x+y*y) % 26)
|
||||
except curses.error:
|
||||
pass
|
||||
|
||||
# Displays a section of the pad in the middle of the screen
|
||||
pad.refresh(0,0, 5,5, 20,75)
|
||||
|
||||
The :func:`refresh` call displays a section of the pad in the rectangle
|
||||
extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper
|
||||
left corner of the displayed section is coordinate (0,0) on the pad. Beyond
|
||||
that difference, pads are exactly like ordinary windows and support the same
|
||||
methods.
|
||||
|
||||
If you have multiple windows and pads on screen there is a more efficient way to
|
||||
go, which will prevent annoying screen flicker at refresh time. Use the
|
||||
:meth:`noutrefresh` method of each window to update the data structure
|
||||
representing the desired state of the screen; then change the physical screen to
|
||||
match the desired state in one go with the function :func:`doupdate`. The
|
||||
normal :meth:`refresh` method calls :func:`doupdate` as its last act.
|
||||
|
||||
|
||||
Displaying Text
|
||||
===============
|
||||
|
||||
From a C programmer's point of view, curses may sometimes look like a twisty
|
||||
maze of functions, all subtly different. For example, :func:`addstr` displays a
|
||||
string at the current cursor location in the ``stdscr`` window, while
|
||||
:func:`mvaddstr` moves to a given y,x coordinate first before displaying the
|
||||
string. :func:`waddstr` is just like :func:`addstr`, but allows specifying a
|
||||
window to use, instead of using ``stdscr`` by default. :func:`mvwaddstr` follows
|
||||
similarly.
|
||||
|
||||
Fortunately the Python interface hides all these details; ``stdscr`` is a window
|
||||
object like any other, and methods like :func:`addstr` accept multiple argument
|
||||
forms. Usually there are four different forms.
|
||||
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| Form | Description |
|
||||
+=================================+===============================================+
|
||||
| *str* or *ch* | Display the string *str* or character *ch* at |
|
||||
| | the current position |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| *str* or *ch*, *attr* | Display the string *str* or character *ch*, |
|
||||
| | using attribute *attr* at the current |
|
||||
| | position |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| *y*, *x*, *str* or *ch* | Move to position *y,x* within the window, and |
|
||||
| | display *str* or *ch* |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| *y*, *x*, *str* or *ch*, *attr* | Move to position *y,x* within the window, and |
|
||||
| | display *str* or *ch*, using attribute *attr* |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
|
||||
Attributes allow displaying text in highlighted forms, such as in boldface,
|
||||
underline, reverse code, or in color. They'll be explained in more detail in
|
||||
the next subsection.
|
||||
|
||||
The :func:`addstr` function takes a Python string as the value to be displayed,
|
||||
while the :func:`addch` functions take a character, which can be either a Python
|
||||
string of length 1 or an integer. If it's a string, you're limited to
|
||||
displaying characters between 0 and 255. SVr4 curses provides constants for
|
||||
extension characters; these constants are integers greater than 255. For
|
||||
example, :const:`ACS_PLMINUS` is a +/- symbol, and :const:`ACS_ULCORNER` is the
|
||||
upper left corner of a box (handy for drawing borders).
|
||||
|
||||
Windows remember where the cursor was left after the last operation, so if you
|
||||
leave out the *y,x* coordinates, the string or character will be displayed
|
||||
wherever the last operation left off. You can also move the cursor with the
|
||||
``move(y,x)`` method. Because some terminals always display a flashing cursor,
|
||||
you may want to ensure that the cursor is positioned in some location where it
|
||||
won't be distracting; it can be confusing to have the cursor blinking at some
|
||||
apparently random location.
|
||||
|
||||
If your application doesn't need a blinking cursor at all, you can call
|
||||
``curs_set(0)`` to make it invisible. Equivalently, and for compatibility with
|
||||
older curses versions, there's a ``leaveok(bool)`` function. When *bool* is
|
||||
true, the curses library will attempt to suppress the flashing cursor, and you
|
||||
won't need to worry about leaving it in odd locations.
|
||||
|
||||
|
||||
Attributes and Color
|
||||
--------------------
|
||||
|
||||
Characters can be displayed in different ways. Status lines in a text-based
|
||||
application are commonly shown in reverse video; a text viewer may need to
|
||||
highlight certain words. curses supports this by allowing you to specify an
|
||||
attribute for each cell on the screen.
|
||||
|
||||
An attribute is an integer, each bit representing a different attribute. You can
|
||||
try to display text with multiple attribute bits set, but curses doesn't
|
||||
guarantee that all the possible combinations are available, or that they're all
|
||||
visually distinct. That depends on the ability of the terminal being used, so
|
||||
it's safest to stick to the most commonly available attributes, listed here.
|
||||
|
||||
+----------------------+--------------------------------------+
|
||||
| Attribute | Description |
|
||||
+======================+======================================+
|
||||
| :const:`A_BLINK` | Blinking text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_BOLD` | Extra bright or bold text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_DIM` | Half bright text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_REVERSE` | Reverse-video text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_STANDOUT` | The best highlighting mode available |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_UNDERLINE` | Underlined text |
|
||||
+----------------------+--------------------------------------+
|
||||
|
||||
So, to display a reverse-video status line on the top line of the screen, you
|
||||
could code::
|
||||
|
||||
stdscr.addstr(0, 0, "Current mode: Typing mode",
|
||||
curses.A_REVERSE)
|
||||
stdscr.refresh()
|
||||
|
||||
The curses library also supports color on those terminals that provide it. The
|
||||
most common such terminal is probably the Linux console, followed by color
|
||||
xterms.
|
||||
|
||||
To use color, you must call the :func:`start_color` function soon after calling
|
||||
:func:`initscr`, to initialize the default color set (the
|
||||
:func:`curses.wrapper.wrapper` function does this automatically). Once that's
|
||||
done, the :func:`has_colors` function returns TRUE if the terminal in use can
|
||||
actually display color. (Note: curses uses the American spelling 'color',
|
||||
instead of the Canadian/British spelling 'colour'. If you're used to the
|
||||
British spelling, you'll have to resign yourself to misspelling it for the sake
|
||||
of these functions.)
|
||||
|
||||
The curses library maintains a finite number of color pairs, containing a
|
||||
foreground (or text) color and a background color. You can get the attribute
|
||||
value corresponding to a color pair with the :func:`color_pair` function; this
|
||||
can be bitwise-OR'ed with other attributes such as :const:`A_REVERSE`, but
|
||||
again, such combinations are not guaranteed to work on all terminals.
|
||||
|
||||
An example, which displays a line of text using color pair 1::
|
||||
|
||||
stdscr.addstr("Pretty text", curses.color_pair(1))
|
||||
stdscr.refresh()
|
||||
|
||||
As I said before, a color pair consists of a foreground and background color.
|
||||
:func:`start_color` initializes 8 basic colors when it activates color mode.
|
||||
They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and
|
||||
7:white. The curses module defines named constants for each of these colors:
|
||||
:const:`curses.COLOR_BLACK`, :const:`curses.COLOR_RED`, and so forth.
|
||||
|
||||
The ``init_pair(n, f, b)`` function changes the definition of color pair *n*, to
|
||||
foreground color f and background color b. Color pair 0 is hard-wired to white
|
||||
on black, and cannot be changed.
|
||||
|
||||
Let's put all this together. To change color 1 to red text on a white
|
||||
background, you would call::
|
||||
|
||||
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
|
||||
|
||||
When you change a color pair, any text already displayed using that color pair
|
||||
will change to the new colors. You can also display new text in this color
|
||||
with::
|
||||
|
||||
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1))
|
||||
|
||||
Very fancy terminals can change the definitions of the actual colors to a given
|
||||
RGB value. This lets you change color 1, which is usually red, to purple or
|
||||
blue or any other color you like. Unfortunately, the Linux console doesn't
|
||||
support this, so I'm unable to try it out, and can't provide any examples. You
|
||||
can check if your terminal can do this by calling :func:`can_change_color`,
|
||||
which returns TRUE if the capability is there. If you're lucky enough to have
|
||||
such a talented terminal, consult your system's man pages for more information.
|
||||
|
||||
|
||||
User Input
|
||||
==========
|
||||
|
||||
The curses library itself offers only very simple input mechanisms. Python's
|
||||
support adds a text-input widget that makes up some of the lack.
|
||||
|
||||
The most common way to get input to a window is to use its :meth:`getch` method.
|
||||
:meth:`getch` pauses and waits for the user to hit a key, displaying it if
|
||||
:func:`echo` has been called earlier. You can optionally specify a coordinate
|
||||
to which the cursor should be moved before pausing.
|
||||
|
||||
It's possible to change this behavior with the method :meth:`nodelay`. After
|
||||
``nodelay(1)``, :meth:`getch` for the window becomes non-blocking and returns
|
||||
``curses.ERR`` (a value of -1) when no input is ready. There's also a
|
||||
:func:`halfdelay` function, which can be used to (in effect) set a timer on each
|
||||
:meth:`getch`; if no input becomes available within a specified
|
||||
delay (measured in tenths of a second), curses raises an exception.
|
||||
|
||||
The :meth:`getch` method returns an integer; if it's between 0 and 255, it
|
||||
represents the ASCII code of the key pressed. Values greater than 255 are
|
||||
special keys such as Page Up, Home, or the cursor keys. You can compare the
|
||||
value returned to constants such as :const:`curses.KEY_PPAGE`,
|
||||
:const:`curses.KEY_HOME`, or :const:`curses.KEY_LEFT`. Usually the main loop of
|
||||
your program will look something like this::
|
||||
|
||||
while 1:
|
||||
c = stdscr.getch()
|
||||
if c == ord('p'):
|
||||
PrintDocument()
|
||||
elif c == ord('q'):
|
||||
break # Exit the while()
|
||||
elif c == curses.KEY_HOME:
|
||||
x = y = 0
|
||||
|
||||
The :mod:`curses.ascii` module supplies ASCII class membership functions that
|
||||
take either integer or 1-character-string arguments; these may be useful in
|
||||
writing more readable tests for your command interpreters. It also supplies
|
||||
conversion functions that take either integer or 1-character-string arguments
|
||||
and return the same type. For example, :func:`curses.ascii.ctrl` returns the
|
||||
control character corresponding to its argument.
|
||||
|
||||
There's also a method to retrieve an entire string, :const:`getstr()`. It isn't
|
||||
used very often, because its functionality is quite limited; the only editing
|
||||
keys available are the backspace key and the Enter key, which terminates the
|
||||
string. It can optionally be limited to a fixed number of characters. ::
|
||||
|
||||
curses.echo() # Enable echoing of characters
|
||||
|
||||
# Get a 15-character string, with the cursor on the top line
|
||||
s = stdscr.getstr(0,0, 15)
|
||||
|
||||
The Python :mod:`curses.textpad` module supplies something better. With it, you
|
||||
can turn a window into a text box that supports an Emacs-like set of
|
||||
keybindings. Various methods of :class:`Textbox` class support editing with
|
||||
input validation and gathering the edit results either with or without trailing
|
||||
spaces. See the library documentation on :mod:`curses.textpad` for the
|
||||
details.
|
||||
|
||||
|
||||
For More Information
|
||||
====================
|
||||
|
||||
This HOWTO didn't cover some advanced topics, such as screen-scraping or
|
||||
capturing mouse events from an xterm instance. But the Python library page for
|
||||
the curses modules is now pretty complete. You should browse it next.
|
||||
|
||||
If you're in doubt about the detailed behavior of any of the ncurses entry
|
||||
points, consult the manual pages for your curses implementation, whether it's
|
||||
ncurses or a proprietary Unix vendor's. The manual pages will document any
|
||||
quirks, and provide complete lists of all the functions, attributes, and
|
||||
:const:`ACS_\*` characters available to you.
|
||||
|
||||
Because the curses API is so large, some functions aren't supported in the
|
||||
Python interface, not because they're difficult to implement, but because no one
|
||||
has needed them yet. Feel free to add them and then submit a patch. Also, we
|
||||
don't yet have support for the menu library associated with
|
||||
ncurses; feel free to add that.
|
||||
|
||||
If you write an interesting little program, feel free to contribute it as
|
||||
another demo. We can always use more of them!
|
||||
|
||||
The ncurses FAQ: http://invisible-island.net/ncurses/ncurses.faq.html
|
||||
439
Doc/howto/descriptor.rst
Normal file
439
Doc/howto/descriptor.rst
Normal file
@@ -0,0 +1,439 @@
|
||||
======================
|
||||
Descriptor HowTo Guide
|
||||
======================
|
||||
|
||||
:Author: Raymond Hettinger
|
||||
:Contact: <python at rcn dot com>
|
||||
|
||||
.. Contents::
|
||||
|
||||
Abstract
|
||||
--------
|
||||
|
||||
Defines descriptors, summarizes the protocol, and shows how descriptors are
|
||||
called. Examines a custom descriptor and several built-in python descriptors
|
||||
including functions, properties, static methods, and class methods. Shows how
|
||||
each works by giving a pure Python equivalent and a sample application.
|
||||
|
||||
Learning about descriptors not only provides access to a larger toolset, it
|
||||
creates a deeper understanding of how Python works and an appreciation for the
|
||||
elegance of its design.
|
||||
|
||||
|
||||
Definition and Introduction
|
||||
---------------------------
|
||||
|
||||
In general, a descriptor is an object attribute with "binding behavior", one
|
||||
whose attribute access has been overridden by methods in the descriptor
|
||||
protocol. Those methods are :meth:`__get__`, :meth:`__set__`, and
|
||||
:meth:`__delete__`. If any of those methods are defined for an object, it is
|
||||
said to be a descriptor.
|
||||
|
||||
The default behavior for attribute access is to get, set, or delete the
|
||||
attribute from an object's dictionary. For instance, ``a.x`` has a lookup chain
|
||||
starting with ``a.__dict__['x']``, then ``type(a).__dict__['x']``, and
|
||||
continuing through the base classes of ``type(a)`` excluding metaclasses. If the
|
||||
looked-up value is an object defining one of the descriptor methods, then Python
|
||||
may override the default behavior and invoke the descriptor method instead.
|
||||
Where this occurs in the precedence chain depends on which descriptor methods
|
||||
were defined. Note that descriptors are only invoked for new style objects or
|
||||
classes (a class is new style if it inherits from :class:`object` or
|
||||
:class:`type`).
|
||||
|
||||
Descriptors are a powerful, general purpose protocol. They are the mechanism
|
||||
behind properties, methods, static methods, class methods, and :func:`super()`.
|
||||
They are used throughout Python itself to implement the new style classes
|
||||
introduced in version 2.2. Descriptors simplify the underlying C-code and offer
|
||||
a flexible set of new tools for everyday Python programs.
|
||||
|
||||
|
||||
Descriptor Protocol
|
||||
-------------------
|
||||
|
||||
``descr.__get__(self, obj, type=None) --> value``
|
||||
|
||||
``descr.__set__(self, obj, value) --> None``
|
||||
|
||||
``descr.__delete__(self, obj) --> None``
|
||||
|
||||
That is all there is to it. Define any of these methods and an object is
|
||||
considered a descriptor and can override default behavior upon being looked up
|
||||
as an attribute.
|
||||
|
||||
If an object defines both :meth:`__get__` and :meth:`__set__`, it is considered
|
||||
a data descriptor. Descriptors that only define :meth:`__get__` are called
|
||||
non-data descriptors (they are typically used for methods but other uses are
|
||||
possible).
|
||||
|
||||
Data and non-data descriptors differ in how overrides are calculated with
|
||||
respect to entries in an instance's dictionary. If an instance's dictionary
|
||||
has an entry with the same name as a data descriptor, the data descriptor
|
||||
takes precedence. If an instance's dictionary has an entry with the same
|
||||
name as a non-data descriptor, the dictionary entry takes precedence.
|
||||
|
||||
To make a read-only data descriptor, define both :meth:`__get__` and
|
||||
:meth:`__set__` with the :meth:`__set__` raising an :exc:`AttributeError` when
|
||||
called. Defining the :meth:`__set__` method with an exception raising
|
||||
placeholder is enough to make it a data descriptor.
|
||||
|
||||
|
||||
Invoking Descriptors
|
||||
--------------------
|
||||
|
||||
A descriptor can be called directly by its method name. For example,
|
||||
``d.__get__(obj)``.
|
||||
|
||||
Alternatively, it is more common for a descriptor to be invoked automatically
|
||||
upon attribute access. For example, ``obj.d`` looks up ``d`` in the dictionary
|
||||
of ``obj``. If ``d`` defines the method :meth:`__get__`, then ``d.__get__(obj)``
|
||||
is invoked according to the precedence rules listed below.
|
||||
|
||||
The details of invocation depend on whether ``obj`` is an object or a class.
|
||||
Either way, descriptors only work for new style objects and classes. A class is
|
||||
new style if it is a subclass of :class:`object`.
|
||||
|
||||
For objects, the machinery is in :meth:`object.__getattribute__` which
|
||||
transforms ``b.x`` into ``type(b).__dict__['x'].__get__(b, type(b))``. The
|
||||
implementation works through a precedence chain that gives data descriptors
|
||||
priority over instance variables, instance variables priority over non-data
|
||||
descriptors, and assigns lowest priority to :meth:`__getattr__` if provided.
|
||||
The full C implementation can be found in :c:func:`PyObject_GenericGetAttr()` in
|
||||
:source:`Objects/object.c`.
|
||||
|
||||
For classes, the machinery is in :meth:`type.__getattribute__` which transforms
|
||||
``B.x`` into ``B.__dict__['x'].__get__(None, B)``. In pure Python, it looks
|
||||
like::
|
||||
|
||||
def __getattribute__(self, key):
|
||||
"Emulate type_getattro() in Objects/typeobject.c"
|
||||
v = object.__getattribute__(self, key)
|
||||
if hasattr(v, '__get__'):
|
||||
return v.__get__(None, self)
|
||||
return v
|
||||
|
||||
The important points to remember are:
|
||||
|
||||
* descriptors are invoked by the :meth:`__getattribute__` method
|
||||
* overriding :meth:`__getattribute__` prevents automatic descriptor calls
|
||||
* :meth:`__getattribute__` is only available with new style classes and objects
|
||||
* :meth:`object.__getattribute__` and :meth:`type.__getattribute__` make
|
||||
different calls to :meth:`__get__`.
|
||||
* data descriptors always override instance dictionaries.
|
||||
* non-data descriptors may be overridden by instance dictionaries.
|
||||
|
||||
The object returned by ``super()`` also has a custom :meth:`__getattribute__`
|
||||
method for invoking descriptors. The call ``super(B, obj).m()`` searches
|
||||
``obj.__class__.__mro__`` for the base class ``A`` immediately following ``B``
|
||||
and then returns ``A.__dict__['m'].__get__(obj, B)``. If not a descriptor,
|
||||
``m`` is returned unchanged. If not in the dictionary, ``m`` reverts to a
|
||||
search using :meth:`object.__getattribute__`.
|
||||
|
||||
Note, in Python 2.2, ``super(B, obj).m()`` would only invoke :meth:`__get__` if
|
||||
``m`` was a data descriptor. In Python 2.3, non-data descriptors also get
|
||||
invoked unless an old-style class is involved. The implementation details are
|
||||
in :c:func:`super_getattro()` in :source:`Objects/typeobject.c`.
|
||||
|
||||
.. _`Guido's Tutorial`: https://www.python.org/download/releases/2.2.3/descrintro/#cooperation
|
||||
|
||||
The details above show that the mechanism for descriptors is embedded in the
|
||||
:meth:`__getattribute__()` methods for :class:`object`, :class:`type`, and
|
||||
:func:`super`. Classes inherit this machinery when they derive from
|
||||
:class:`object` or if they have a meta-class providing similar functionality.
|
||||
Likewise, classes can turn-off descriptor invocation by overriding
|
||||
:meth:`__getattribute__()`.
|
||||
|
||||
|
||||
Descriptor Example
|
||||
------------------
|
||||
|
||||
The following code creates a class whose objects are data descriptors which
|
||||
print a message for each get or set. Overriding :meth:`__getattribute__` is
|
||||
alternate approach that could do this for every attribute. However, this
|
||||
descriptor is useful for monitoring just a few chosen attributes::
|
||||
|
||||
class RevealAccess(object):
|
||||
"""A data descriptor that sets and returns values
|
||||
normally and prints a message logging their access.
|
||||
"""
|
||||
|
||||
def __init__(self, initval=None, name='var'):
|
||||
self.val = initval
|
||||
self.name = name
|
||||
|
||||
def __get__(self, obj, objtype):
|
||||
print 'Retrieving', self.name
|
||||
return self.val
|
||||
|
||||
def __set__(self, obj, val):
|
||||
print 'Updating', self.name
|
||||
self.val = val
|
||||
|
||||
>>> class MyClass(object):
|
||||
... x = RevealAccess(10, 'var "x"')
|
||||
... y = 5
|
||||
...
|
||||
>>> m = MyClass()
|
||||
>>> m.x
|
||||
Retrieving var "x"
|
||||
10
|
||||
>>> m.x = 20
|
||||
Updating var "x"
|
||||
>>> m.x
|
||||
Retrieving var "x"
|
||||
20
|
||||
>>> m.y
|
||||
5
|
||||
|
||||
The protocol is simple and offers exciting possibilities. Several use cases are
|
||||
so common that they have been packaged into individual function calls.
|
||||
Properties, bound and unbound methods, static methods, and class methods are all
|
||||
based on the descriptor protocol.
|
||||
|
||||
|
||||
Properties
|
||||
----------
|
||||
|
||||
Calling :func:`property` is a succinct way of building a data descriptor that
|
||||
triggers function calls upon access to an attribute. Its signature is::
|
||||
|
||||
property(fget=None, fset=None, fdel=None, doc=None) -> property attribute
|
||||
|
||||
The documentation shows a typical use to define a managed attribute ``x``::
|
||||
|
||||
class C(object):
|
||||
def getx(self): return self.__x
|
||||
def setx(self, value): self.__x = value
|
||||
def delx(self): del self.__x
|
||||
x = property(getx, setx, delx, "I'm the 'x' property.")
|
||||
|
||||
To see how :func:`property` is implemented in terms of the descriptor protocol,
|
||||
here is a pure Python equivalent::
|
||||
|
||||
class Property(object):
|
||||
"Emulate PyProperty_Type() in Objects/descrobject.c"
|
||||
|
||||
def __init__(self, fget=None, fset=None, fdel=None, doc=None):
|
||||
self.fget = fget
|
||||
self.fset = fset
|
||||
self.fdel = fdel
|
||||
if doc is None and fget is not None:
|
||||
doc = fget.__doc__
|
||||
self.__doc__ = doc
|
||||
|
||||
def __get__(self, obj, objtype=None):
|
||||
if obj is None:
|
||||
return self
|
||||
if self.fget is None:
|
||||
raise AttributeError("unreadable attribute")
|
||||
return self.fget(obj)
|
||||
|
||||
def __set__(self, obj, value):
|
||||
if self.fset is None:
|
||||
raise AttributeError("can't set attribute")
|
||||
self.fset(obj, value)
|
||||
|
||||
def __delete__(self, obj):
|
||||
if self.fdel is None:
|
||||
raise AttributeError("can't delete attribute")
|
||||
self.fdel(obj)
|
||||
|
||||
def getter(self, fget):
|
||||
return type(self)(fget, self.fset, self.fdel, self.__doc__)
|
||||
|
||||
def setter(self, fset):
|
||||
return type(self)(self.fget, fset, self.fdel, self.__doc__)
|
||||
|
||||
def deleter(self, fdel):
|
||||
return type(self)(self.fget, self.fset, fdel, self.__doc__)
|
||||
|
||||
The :func:`property` builtin helps whenever a user interface has granted
|
||||
attribute access and then subsequent changes require the intervention of a
|
||||
method.
|
||||
|
||||
For instance, a spreadsheet class may grant access to a cell value through
|
||||
``Cell('b10').value``. Subsequent improvements to the program require the cell
|
||||
to be recalculated on every access; however, the programmer does not want to
|
||||
affect existing client code accessing the attribute directly. The solution is
|
||||
to wrap access to the value attribute in a property data descriptor::
|
||||
|
||||
class Cell(object):
|
||||
. . .
|
||||
def getvalue(self):
|
||||
"Recalculate the cell before returning value"
|
||||
self.recalc()
|
||||
return self._value
|
||||
value = property(getvalue)
|
||||
|
||||
|
||||
Functions and Methods
|
||||
---------------------
|
||||
|
||||
Python's object oriented features are built upon a function based environment.
|
||||
Using non-data descriptors, the two are merged seamlessly.
|
||||
|
||||
Class dictionaries store methods as functions. In a class definition, methods
|
||||
are written using :keyword:`def` and :keyword:`lambda`, the usual tools for
|
||||
creating functions. The only difference from regular functions is that the
|
||||
first argument is reserved for the object instance. By Python convention, the
|
||||
instance reference is called *self* but may be called *this* or any other
|
||||
variable name.
|
||||
|
||||
To support method calls, functions include the :meth:`__get__` method for
|
||||
binding methods during attribute access. This means that all functions are
|
||||
non-data descriptors which return bound or unbound methods depending whether
|
||||
they are invoked from an object or a class. In pure python, it works like
|
||||
this::
|
||||
|
||||
class Function(object):
|
||||
. . .
|
||||
def __get__(self, obj, objtype=None):
|
||||
"Simulate func_descr_get() in Objects/funcobject.c"
|
||||
return types.MethodType(self, obj, objtype)
|
||||
|
||||
Running the interpreter shows how the function descriptor works in practice::
|
||||
|
||||
>>> class D(object):
|
||||
... def f(self, x):
|
||||
... return x
|
||||
...
|
||||
>>> d = D()
|
||||
>>> D.__dict__['f'] # Stored internally as a function
|
||||
<function f at 0x00C45070>
|
||||
>>> D.f # Get from a class becomes an unbound method
|
||||
<unbound method D.f>
|
||||
>>> d.f # Get from an instance becomes a bound method
|
||||
<bound method D.f of <__main__.D object at 0x00B18C90>>
|
||||
|
||||
The output suggests that bound and unbound methods are two different types.
|
||||
While they could have been implemented that way, the actual C implementation of
|
||||
:c:type:`PyMethod_Type` in :source:`Objects/classobject.c` is a single object
|
||||
with two different representations depending on whether the :attr:`im_self`
|
||||
field is set or is *NULL* (the C equivalent of ``None``).
|
||||
|
||||
Likewise, the effects of calling a method object depend on the :attr:`im_self`
|
||||
field. If set (meaning bound), the original function (stored in the
|
||||
:attr:`im_func` field) is called as expected with the first argument set to the
|
||||
instance. If unbound, all of the arguments are passed unchanged to the original
|
||||
function. The actual C implementation of :func:`instancemethod_call()` is only
|
||||
slightly more complex in that it includes some type checking.
|
||||
|
||||
|
||||
Static Methods and Class Methods
|
||||
--------------------------------
|
||||
|
||||
Non-data descriptors provide a simple mechanism for variations on the usual
|
||||
patterns of binding functions into methods.
|
||||
|
||||
To recap, functions have a :meth:`__get__` method so that they can be converted
|
||||
to a method when accessed as attributes. The non-data descriptor transforms an
|
||||
``obj.f(*args)`` call into ``f(obj, *args)``. Calling ``klass.f(*args)``
|
||||
becomes ``f(*args)``.
|
||||
|
||||
This chart summarizes the binding and its two most useful variants:
|
||||
|
||||
+-----------------+----------------------+------------------+
|
||||
| Transformation | Called from an | Called from a |
|
||||
| | Object | Class |
|
||||
+=================+======================+==================+
|
||||
| function | f(obj, \*args) | f(\*args) |
|
||||
+-----------------+----------------------+------------------+
|
||||
| staticmethod | f(\*args) | f(\*args) |
|
||||
+-----------------+----------------------+------------------+
|
||||
| classmethod | f(type(obj), \*args) | f(klass, \*args) |
|
||||
+-----------------+----------------------+------------------+
|
||||
|
||||
Static methods return the underlying function without changes. Calling either
|
||||
``c.f`` or ``C.f`` is the equivalent of a direct lookup into
|
||||
``object.__getattribute__(c, "f")`` or ``object.__getattribute__(C, "f")``. As a
|
||||
result, the function becomes identically accessible from either an object or a
|
||||
class.
|
||||
|
||||
Good candidates for static methods are methods that do not reference the
|
||||
``self`` variable.
|
||||
|
||||
For instance, a statistics package may include a container class for
|
||||
experimental data. The class provides normal methods for computing the average,
|
||||
mean, median, and other descriptive statistics that depend on the data. However,
|
||||
there may be useful functions which are conceptually related but do not depend
|
||||
on the data. For instance, ``erf(x)`` is handy conversion routine that comes up
|
||||
in statistical work but does not directly depend on a particular dataset.
|
||||
It can be called either from an object or the class: ``s.erf(1.5) --> .9332`` or
|
||||
``Sample.erf(1.5) --> .9332``.
|
||||
|
||||
Since staticmethods return the underlying function with no changes, the example
|
||||
calls are unexciting::
|
||||
|
||||
>>> class E(object):
|
||||
... def f(x):
|
||||
... print x
|
||||
... f = staticmethod(f)
|
||||
...
|
||||
>>> print E.f(3)
|
||||
3
|
||||
>>> print E().f(3)
|
||||
3
|
||||
|
||||
Using the non-data descriptor protocol, a pure Python version of
|
||||
:func:`staticmethod` would look like this::
|
||||
|
||||
class StaticMethod(object):
|
||||
"Emulate PyStaticMethod_Type() in Objects/funcobject.c"
|
||||
|
||||
def __init__(self, f):
|
||||
self.f = f
|
||||
|
||||
def __get__(self, obj, objtype=None):
|
||||
return self.f
|
||||
|
||||
Unlike static methods, class methods prepend the class reference to the
|
||||
argument list before calling the function. This format is the same
|
||||
for whether the caller is an object or a class::
|
||||
|
||||
>>> class E(object):
|
||||
... def f(klass, x):
|
||||
... return klass.__name__, x
|
||||
... f = classmethod(f)
|
||||
...
|
||||
>>> print E.f(3)
|
||||
('E', 3)
|
||||
>>> print E().f(3)
|
||||
('E', 3)
|
||||
|
||||
|
||||
This behavior is useful whenever the function only needs to have a class
|
||||
reference and does not care about any underlying data. One use for classmethods
|
||||
is to create alternate class constructors. In Python 2.3, the classmethod
|
||||
:func:`dict.fromkeys` creates a new dictionary from a list of keys. The pure
|
||||
Python equivalent is::
|
||||
|
||||
class Dict(object):
|
||||
. . .
|
||||
def fromkeys(klass, iterable, value=None):
|
||||
"Emulate dict_fromkeys() in Objects/dictobject.c"
|
||||
d = klass()
|
||||
for key in iterable:
|
||||
d[key] = value
|
||||
return d
|
||||
fromkeys = classmethod(fromkeys)
|
||||
|
||||
Now a new dictionary of unique keys can be constructed like this::
|
||||
|
||||
>>> Dict.fromkeys('abracadabra')
|
||||
{'a': None, 'r': None, 'b': None, 'c': None, 'd': None}
|
||||
|
||||
Using the non-data descriptor protocol, a pure Python version of
|
||||
:func:`classmethod` would look like this::
|
||||
|
||||
class ClassMethod(object):
|
||||
"Emulate PyClassMethod_Type() in Objects/funcobject.c"
|
||||
|
||||
def __init__(self, f):
|
||||
self.f = f
|
||||
|
||||
def __get__(self, obj, klass=None):
|
||||
if klass is None:
|
||||
klass = type(obj)
|
||||
def newfunc(*args):
|
||||
return self.f(klass, *args)
|
||||
return newfunc
|
||||
|
||||
327
Doc/howto/doanddont.rst
Normal file
327
Doc/howto/doanddont.rst
Normal file
@@ -0,0 +1,327 @@
|
||||
************************************
|
||||
Idioms and Anti-Idioms in Python
|
||||
************************************
|
||||
|
||||
:Author: Moshe Zadka
|
||||
|
||||
This document is placed in the public domain.
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
This document can be considered a companion to the tutorial. It shows how to use
|
||||
Python, and even more importantly, how *not* to use Python.
|
||||
|
||||
|
||||
Language Constructs You Should Not Use
|
||||
======================================
|
||||
|
||||
While Python has relatively few gotchas compared to other languages, it still
|
||||
has some constructs which are only useful in corner cases, or are plain
|
||||
dangerous.
|
||||
|
||||
|
||||
from module import \*
|
||||
---------------------
|
||||
|
||||
|
||||
Inside Function Definitions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
``from module import *`` is *invalid* inside function definitions. While many
|
||||
versions of Python do not check for the invalidity, it does not make it more
|
||||
valid, no more than having a smart lawyer makes a man innocent. Do not use it
|
||||
like that ever. Even in versions where it was accepted, it made the function
|
||||
execution slower, because the compiler could not be certain which names were
|
||||
local and which were global. In Python 2.1 this construct causes warnings, and
|
||||
sometimes even errors.
|
||||
|
||||
|
||||
At Module Level
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
While it is valid to use ``from module import *`` at module level it is usually
|
||||
a bad idea. For one, this loses an important property Python otherwise has ---
|
||||
you can know where each toplevel name is defined by a simple "search" function
|
||||
in your favourite editor. You also open yourself to trouble in the future, if
|
||||
some module grows additional functions or classes.
|
||||
|
||||
One of the most awful questions asked on the newsgroup is why this code::
|
||||
|
||||
f = open("www")
|
||||
f.read()
|
||||
|
||||
does not work. Of course, it works just fine (assuming you have a file called
|
||||
"www".) But it does not work if somewhere in the module, the statement ``from
|
||||
os import *`` is present. The :mod:`os` module has a function called
|
||||
:func:`open` which returns an integer. While it is very useful, shadowing a
|
||||
builtin is one of its least useful properties.
|
||||
|
||||
Remember, you can never know for sure what names a module exports, so either
|
||||
take what you need --- ``from module import name1, name2``, or keep them in the
|
||||
module and access on a per-need basis --- ``import module;print module.name``.
|
||||
|
||||
|
||||
When It Is Just Fine
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are situations in which ``from module import *`` is just fine:
|
||||
|
||||
* The interactive prompt. For example, ``from math import *`` makes Python an
|
||||
amazing scientific calculator.
|
||||
|
||||
* When extending a module in C with a module in Python.
|
||||
|
||||
* When the module advertises itself as ``from import *`` safe.
|
||||
|
||||
|
||||
Unadorned :keyword:`exec`, :func:`execfile` and friends
|
||||
-------------------------------------------------------
|
||||
|
||||
The word "unadorned" refers to the use without an explicit dictionary, in which
|
||||
case those constructs evaluate code in the *current* environment. This is
|
||||
dangerous for the same reasons ``from import *`` is dangerous --- it might step
|
||||
over variables you are counting on and mess up things for the rest of your code.
|
||||
Simply do not do that.
|
||||
|
||||
Bad examples::
|
||||
|
||||
>>> for name in sys.argv[1:]:
|
||||
>>> exec "%s=1" % name
|
||||
>>> def func(s, **kw):
|
||||
>>> for var, val in kw.items():
|
||||
>>> exec "s.%s=val" % var # invalid!
|
||||
>>> execfile("handler.py")
|
||||
>>> handle()
|
||||
|
||||
Good examples::
|
||||
|
||||
>>> d = {}
|
||||
>>> for name in sys.argv[1:]:
|
||||
>>> d[name] = 1
|
||||
>>> def func(s, **kw):
|
||||
>>> for var, val in kw.items():
|
||||
>>> setattr(s, var, val)
|
||||
>>> d={}
|
||||
>>> execfile("handle.py", d, d)
|
||||
>>> handle = d['handle']
|
||||
>>> handle()
|
||||
|
||||
|
||||
from module import name1, name2
|
||||
-------------------------------
|
||||
|
||||
This is a "don't" which is much weaker than the previous "don't"s but is still
|
||||
something you should not do if you don't have good reasons to do that. The
|
||||
reason it is usually a bad idea is because you suddenly have an object which lives
|
||||
in two separate namespaces. When the binding in one namespace changes, the
|
||||
binding in the other will not, so there will be a discrepancy between them. This
|
||||
happens when, for example, one module is reloaded, or changes the definition of
|
||||
a function at runtime.
|
||||
|
||||
Bad example::
|
||||
|
||||
# foo.py
|
||||
a = 1
|
||||
|
||||
# bar.py
|
||||
from foo import a
|
||||
if something():
|
||||
a = 2 # danger: foo.a != a
|
||||
|
||||
Good example::
|
||||
|
||||
# foo.py
|
||||
a = 1
|
||||
|
||||
# bar.py
|
||||
import foo
|
||||
if something():
|
||||
foo.a = 2
|
||||
|
||||
|
||||
except:
|
||||
-------
|
||||
|
||||
Python has the ``except:`` clause, which catches all exceptions. Since *every*
|
||||
error in Python raises an exception, using ``except:`` can make many
|
||||
programming errors look like runtime problems, which hinders the debugging
|
||||
process.
|
||||
|
||||
The following code shows a great example of why this is bad::
|
||||
|
||||
try:
|
||||
foo = opne("file") # misspelled "open"
|
||||
except:
|
||||
sys.exit("could not open file!")
|
||||
|
||||
The second line triggers a :exc:`NameError`, which is caught by the except
|
||||
clause. The program will exit, and the error message the program prints will
|
||||
make you think the problem is the readability of ``"file"`` when in fact
|
||||
the real error has nothing to do with ``"file"``.
|
||||
|
||||
A better way to write the above is ::
|
||||
|
||||
try:
|
||||
foo = opne("file")
|
||||
except IOError:
|
||||
sys.exit("could not open file")
|
||||
|
||||
When this is run, Python will produce a traceback showing the :exc:`NameError`,
|
||||
and it will be immediately apparent what needs to be fixed.
|
||||
|
||||
.. index:: bare except, except; bare
|
||||
|
||||
Because ``except:`` catches *all* exceptions, including :exc:`SystemExit`,
|
||||
:exc:`KeyboardInterrupt`, and :exc:`GeneratorExit` (which is not an error and
|
||||
should not normally be caught by user code), using a bare ``except:`` is almost
|
||||
never a good idea. In situations where you need to catch all "normal" errors,
|
||||
such as in a framework that runs callbacks, you can catch the base class for
|
||||
all normal exceptions, :exc:`Exception`. Unfortunately in Python 2.x it is
|
||||
possible for third-party code to raise exceptions that do not inherit from
|
||||
:exc:`Exception`, so in Python 2.x there are some cases where you may have to
|
||||
use a bare ``except:`` and manually re-raise the exceptions you don't want
|
||||
to catch.
|
||||
|
||||
|
||||
Exceptions
|
||||
==========
|
||||
|
||||
Exceptions are a useful feature of Python. You should learn to raise them
|
||||
whenever something unexpected occurs, and catch them only where you can do
|
||||
something about them.
|
||||
|
||||
The following is a very popular anti-idiom ::
|
||||
|
||||
def get_status(file):
|
||||
if not os.path.exists(file):
|
||||
print "file not found"
|
||||
sys.exit(1)
|
||||
return open(file).readline()
|
||||
|
||||
Consider the case where the file gets deleted between the time the call to
|
||||
:func:`os.path.exists` is made and the time :func:`open` is called. In that
|
||||
case the last line will raise an :exc:`IOError`. The same thing would happen
|
||||
if *file* exists but has no read permission. Since testing this on a normal
|
||||
machine on existent and non-existent files makes it seem bugless, the test
|
||||
results will seem fine, and the code will get shipped. Later an unhandled
|
||||
:exc:`IOError` (or perhaps some other :exc:`EnvironmentError`) escapes to the
|
||||
user, who gets to watch the ugly traceback.
|
||||
|
||||
Here is a somewhat better way to do it. ::
|
||||
|
||||
def get_status(file):
|
||||
try:
|
||||
return open(file).readline()
|
||||
except EnvironmentError as err:
|
||||
print "Unable to open file: {}".format(err)
|
||||
sys.exit(1)
|
||||
|
||||
In this version, *either* the file gets opened and the line is read (so it
|
||||
works even on flaky NFS or SMB connections), or an error message is printed
|
||||
that provides all the available information on why the open failed, and the
|
||||
application is aborted.
|
||||
|
||||
However, even this version of :func:`get_status` makes too many assumptions ---
|
||||
that it will only be used in a short running script, and not, say, in a long
|
||||
running server. Sure, the caller could do something like ::
|
||||
|
||||
try:
|
||||
status = get_status(log)
|
||||
except SystemExit:
|
||||
status = None
|
||||
|
||||
But there is a better way. You should try to use as few ``except`` clauses in
|
||||
your code as you can --- the ones you do use will usually be inside calls which
|
||||
should always succeed, or a catch-all in a main function.
|
||||
|
||||
So, an even better version of :func:`get_status()` is probably ::
|
||||
|
||||
def get_status(file):
|
||||
return open(file).readline()
|
||||
|
||||
The caller can deal with the exception if it wants (for example, if it tries
|
||||
several files in a loop), or just let the exception filter upwards to *its*
|
||||
caller.
|
||||
|
||||
But the last version still has a serious problem --- due to implementation
|
||||
details in CPython, the file would not be closed when an exception is raised
|
||||
until the exception handler finishes; and, worse, in other implementations
|
||||
(e.g., Jython) it might not be closed at all regardless of whether or not
|
||||
an exception is raised.
|
||||
|
||||
The best version of this function uses the ``open()`` call as a context
|
||||
manager, which will ensure that the file gets closed as soon as the
|
||||
function returns::
|
||||
|
||||
def get_status(file):
|
||||
with open(file) as fp:
|
||||
return fp.readline()
|
||||
|
||||
|
||||
Using the Batteries
|
||||
===================
|
||||
|
||||
Every so often, people seem to be writing stuff in the Python library again,
|
||||
usually poorly. While the occasional module has a poor interface, it is usually
|
||||
much better to use the rich standard library and data types that come with
|
||||
Python than inventing your own.
|
||||
|
||||
A useful module very few people know about is :mod:`os.path`. It always has the
|
||||
correct path arithmetic for your operating system, and will usually be much
|
||||
better than whatever you come up with yourself.
|
||||
|
||||
Compare::
|
||||
|
||||
# ugh!
|
||||
return dir+"/"+file
|
||||
# better
|
||||
return os.path.join(dir, file)
|
||||
|
||||
More useful functions in :mod:`os.path`: :func:`basename`, :func:`dirname` and
|
||||
:func:`splitext`.
|
||||
|
||||
There are also many useful built-in functions people seem not to be aware of
|
||||
for some reason: :func:`min` and :func:`max` can find the minimum/maximum of
|
||||
any sequence with comparable semantics, for example, yet many people write
|
||||
their own :func:`max`/:func:`min`. Another highly useful function is
|
||||
:func:`reduce` which can be used to repeatedly apply a binary operation to a
|
||||
sequence, reducing it to a single value. For example, compute a factorial
|
||||
with a series of multiply operations::
|
||||
|
||||
>>> n = 4
|
||||
>>> import operator
|
||||
>>> reduce(operator.mul, range(1, n+1))
|
||||
24
|
||||
|
||||
When it comes to parsing numbers, note that :func:`float`, :func:`int` and
|
||||
:func:`long` all accept string arguments and will reject ill-formed strings
|
||||
by raising an :exc:`ValueError`.
|
||||
|
||||
|
||||
Using Backslash to Continue Statements
|
||||
======================================
|
||||
|
||||
Since Python treats a newline as a statement terminator, and since statements
|
||||
are often more than is comfortable to put in one line, many people do::
|
||||
|
||||
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
|
||||
calculate_number(10, 20) != forbulate(500, 360):
|
||||
pass
|
||||
|
||||
You should realize that this is dangerous: a stray space after the ``\`` would
|
||||
make this line wrong, and stray spaces are notoriously hard to see in editors.
|
||||
In this case, at least it would be a syntax error, but if the code was::
|
||||
|
||||
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
|
||||
+ calculate_number(10, 20)*forbulate(500, 360)
|
||||
|
||||
then it would just be subtly wrong.
|
||||
|
||||
It is usually much better to use the implicit continuation inside parenthesis:
|
||||
|
||||
This version is bulletproof::
|
||||
|
||||
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
|
||||
+ calculate_number(10, 20)*forbulate(500, 360))
|
||||
|
||||
1250
Doc/howto/functional.rst
Normal file
1250
Doc/howto/functional.rst
Normal file
File diff suppressed because it is too large
Load Diff
31
Doc/howto/index.rst
Normal file
31
Doc/howto/index.rst
Normal file
@@ -0,0 +1,31 @@
|
||||
***************
|
||||
Python HOWTOs
|
||||
***************
|
||||
|
||||
Python HOWTOs are documents that cover a single, specific topic,
|
||||
and attempt to cover it fairly completely. Modelled on the Linux
|
||||
Documentation Project's HOWTO collection, this collection is an
|
||||
effort to foster documentation that's more detailed than the
|
||||
Python Library Reference.
|
||||
|
||||
Currently, the HOWTOs are:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
pyporting.rst
|
||||
cporting.rst
|
||||
curses.rst
|
||||
descriptor.rst
|
||||
doanddont.rst
|
||||
functional.rst
|
||||
logging.rst
|
||||
logging-cookbook.rst
|
||||
regex.rst
|
||||
sockets.rst
|
||||
sorting.rst
|
||||
unicode.rst
|
||||
urllib2.rst
|
||||
webservers.rst
|
||||
argparse.rst
|
||||
|
||||
1538
Doc/howto/logging-cookbook.rst
Normal file
1538
Doc/howto/logging-cookbook.rst
Normal file
File diff suppressed because it is too large
Load Diff
1051
Doc/howto/logging.rst
Normal file
1051
Doc/howto/logging.rst
Normal file
File diff suppressed because it is too large
Load Diff
BIN
Doc/howto/logging_flow.png
Executable file
BIN
Doc/howto/logging_flow.png
Executable file
Binary file not shown.
|
After Width: | Height: | Size: 48 KiB |
452
Doc/howto/pyporting.rst
Normal file
452
Doc/howto/pyporting.rst
Normal file
@@ -0,0 +1,452 @@
|
||||
.. _pyporting-howto:
|
||||
|
||||
*********************************
|
||||
Porting Python 2 Code to Python 3
|
||||
*********************************
|
||||
|
||||
:author: Brett Cannon
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
With Python 3 being the future of Python while Python 2 is still in active
|
||||
use, it is good to have your project available for both major releases of
|
||||
Python. This guide is meant to help you figure out how best to support both
|
||||
Python 2 & 3 simultaneously.
|
||||
|
||||
If you are looking to port an extension module instead of pure Python code,
|
||||
please see :ref:`cporting-howto`.
|
||||
|
||||
If you would like to read one core Python developer's take on why Python 3
|
||||
came into existence, you can read Nick Coghlan's `Python 3 Q & A`_ or
|
||||
Brett Cannon's `Why Python 3 exists`_.
|
||||
|
||||
For help with porting, you can email the python-porting_ mailing list with
|
||||
questions.
|
||||
|
||||
The Short Explanation
|
||||
=====================
|
||||
|
||||
To make your project be single-source Python 2/3 compatible, the basic steps
|
||||
are:
|
||||
|
||||
#. Only worry about supporting Python 2.7
|
||||
#. Make sure you have good test coverage (coverage.py_ can help;
|
||||
``pip install coverage``)
|
||||
#. Learn the differences between Python 2 & 3
|
||||
#. Use Futurize_ (or Modernize_) to update your code (e.g. ``pip install future``)
|
||||
#. Use Pylint_ to help make sure you don't regress on your Python 3 support
|
||||
(``pip install pylint``)
|
||||
#. Use caniusepython3_ to find out which of your dependencies are blocking your
|
||||
use of Python 3 (``pip install caniusepython3``)
|
||||
#. Once your dependencies are no longer blocking you, use continuous integration
|
||||
to make sure you stay compatible with Python 2 & 3 (tox_ can help test
|
||||
against multiple versions of Python; ``pip install tox``)
|
||||
#. Consider using optional static type checking to make sure your type usage
|
||||
works in both Python 2 & 3 (e.g. use mypy_ to check your typing under both
|
||||
Python 2 & Python 3).
|
||||
|
||||
|
||||
Details
|
||||
=======
|
||||
|
||||
A key point about supporting Python 2 & 3 simultaneously is that you can start
|
||||
**today**! Even if your dependencies are not supporting Python 3 yet that does
|
||||
not mean you can't modernize your code **now** to support Python 3. Most changes
|
||||
required to support Python 3 lead to cleaner code using newer practices even in
|
||||
Python 2 code.
|
||||
|
||||
Another key point is that modernizing your Python 2 code to also support
|
||||
Python 3 is largely automated for you. While you might have to make some API
|
||||
decisions thanks to Python 3 clarifying text data versus binary data, the
|
||||
lower-level work is now mostly done for you and thus can at least benefit from
|
||||
the automated changes immediately.
|
||||
|
||||
Keep those key points in mind while you read on about the details of porting
|
||||
your code to support Python 2 & 3 simultaneously.
|
||||
|
||||
|
||||
Drop support for Python 2.6 and older
|
||||
-------------------------------------
|
||||
|
||||
While you can make Python 2.5 work with Python 3, it is **much** easier if you
|
||||
only have to work with Python 2.7. If dropping Python 2.5 is not an
|
||||
option then the six_ project can help you support Python 2.5 & 3 simultaneously
|
||||
(``pip install six``). Do realize, though, that nearly all the projects listed
|
||||
in this HOWTO will not be available to you.
|
||||
|
||||
If you are able to skip Python 2.5 and older, then the required changes
|
||||
to your code should continue to look and feel like idiomatic Python code. At
|
||||
worst you will have to use a function instead of a method in some instances or
|
||||
have to import a function instead of using a built-in one, but otherwise the
|
||||
overall transformation should not feel foreign to you.
|
||||
|
||||
But you should aim for only supporting Python 2.7. Python 2.6 is no longer
|
||||
freely supported and thus is not receiving bugfixes. This means **you** will have
|
||||
to work around any issues you come across with Python 2.6. There are also some
|
||||
tools mentioned in this HOWTO which do not support Python 2.6 (e.g., Pylint_),
|
||||
and this will become more commonplace as time goes on. It will simply be easier
|
||||
for you if you only support the versions of Python that you have to support.
|
||||
|
||||
|
||||
Make sure you specify the proper version support in your ``setup.py`` file
|
||||
--------------------------------------------------------------------------
|
||||
|
||||
In your ``setup.py`` file you should have the proper `trove classifier`_
|
||||
specifying what versions of Python you support. As your project does not support
|
||||
Python 3 yet you should at least have
|
||||
``Programming Language :: Python :: 2 :: Only`` specified. Ideally you should
|
||||
also specify each major/minor version of Python that you do support, e.g.
|
||||
``Programming Language :: Python :: 2.7``.
|
||||
|
||||
|
||||
Have good test coverage
|
||||
-----------------------
|
||||
|
||||
Once you have your code supporting the oldest version of Python 2 you want it
|
||||
to, you will want to make sure your test suite has good coverage. A good rule of
|
||||
thumb is that if you want to be confident enough in your test suite that any
|
||||
failures that appear after having tools rewrite your code are actual bugs in the
|
||||
tools and not in your code. If you want a number to aim for, try to get over 80%
|
||||
coverage (and don't feel bad if you find it hard to get better than 90%
|
||||
coverage). If you don't already have a tool to measure test coverage then
|
||||
coverage.py_ is recommended.
|
||||
|
||||
|
||||
Learn the differences between Python 2 & 3
|
||||
-------------------------------------------
|
||||
|
||||
Once you have your code well-tested you are ready to begin porting your code to
|
||||
Python 3! But to fully understand how your code is going to change and what
|
||||
you want to look out for while you code, you will want to learn what changes
|
||||
Python 3 makes in terms of Python 2. Typically the two best ways of doing that
|
||||
is reading the `"What's New"`_ doc for each release of Python 3 and the
|
||||
`Porting to Python 3`_ book (which is free online). There is also a handy
|
||||
`cheat sheet`_ from the Python-Future project.
|
||||
|
||||
|
||||
Update your code
|
||||
----------------
|
||||
|
||||
Once you feel like you know what is different in Python 3 compared to Python 2,
|
||||
it's time to update your code! You have a choice between two tools in porting
|
||||
your code automatically: Futurize_ and Modernize_. Which tool you choose will
|
||||
depend on how much like Python 3 you want your code to be. Futurize_ does its
|
||||
best to make Python 3 idioms and practices exist in Python 2, e.g. backporting
|
||||
the ``bytes`` type from Python 3 so that you have semantic parity between the
|
||||
major versions of Python. Modernize_,
|
||||
on the other hand, is more conservative and targets a Python 2/3 subset of
|
||||
Python, directly relying on six_ to help provide compatibility. As Python 3 is
|
||||
the future, it might be best to consider Futurize to begin adjusting to any new
|
||||
practices that Python 3 introduces which you are not accustomed to yet.
|
||||
|
||||
Regardless of which tool you choose, they will update your code to run under
|
||||
Python 3 while staying compatible with the version of Python 2 you started with.
|
||||
Depending on how conservative you want to be, you may want to run the tool over
|
||||
your test suite first and visually inspect the diff to make sure the
|
||||
transformation is accurate. After you have transformed your test suite and
|
||||
verified that all the tests still pass as expected, then you can transform your
|
||||
application code knowing that any tests which fail is a translation failure.
|
||||
|
||||
Unfortunately the tools can't automate everything to make your code work under
|
||||
Python 3 and so there are a handful of things you will need to update manually
|
||||
to get full Python 3 support (which of these steps are necessary vary between
|
||||
the tools). Read the documentation for the tool you choose to use to see what it
|
||||
fixes by default and what it can do optionally to know what will (not) be fixed
|
||||
for you and what you may have to fix on your own (e.g. using ``io.open()`` over
|
||||
the built-in ``open()`` function is off by default in Modernize). Luckily,
|
||||
though, there are only a couple of things to watch out for which can be
|
||||
considered large issues that may be hard to debug if not watched for.
|
||||
|
||||
|
||||
Division
|
||||
++++++++
|
||||
|
||||
In Python 3, ``5 / 2 == 2.5`` and not ``2``; all division between ``int`` values
|
||||
result in a ``float``. This change has actually been planned since Python 2.2
|
||||
which was released in 2002. Since then users have been encouraged to add
|
||||
``from __future__ import division`` to any and all files which use the ``/`` and
|
||||
``//`` operators or to be running the interpreter with the ``-Q`` flag. If you
|
||||
have not been doing this then you will need to go through your code and do two
|
||||
things:
|
||||
|
||||
#. Add ``from __future__ import division`` to your files
|
||||
#. Update any division operator as necessary to either use ``//`` to use floor
|
||||
division or continue using ``/`` and expect a float
|
||||
|
||||
The reason that ``/`` isn't simply translated to ``//`` automatically is that if
|
||||
an object defines a ``__truediv__`` method but not ``__floordiv__`` then your
|
||||
code would begin to fail (e.g. a user-defined class that uses ``/`` to
|
||||
signify some operation but not ``//`` for the same thing or at all).
|
||||
|
||||
|
||||
Text versus binary data
|
||||
+++++++++++++++++++++++
|
||||
|
||||
In Python 2 you could use the ``str`` type for both text and binary data.
|
||||
Unfortunately this confluence of two different concepts could lead to brittle
|
||||
code which sometimes worked for either kind of data, sometimes not. It also
|
||||
could lead to confusing APIs if people didn't explicitly state that something
|
||||
that accepted ``str`` accepted either text or binary data instead of one
|
||||
specific type. This complicated the situation especially for anyone supporting
|
||||
multiple languages as APIs wouldn't bother explicitly supporting ``unicode``
|
||||
when they claimed text data support.
|
||||
|
||||
To make the distinction between text and binary data clearer and more
|
||||
pronounced, Python 3 did what most languages created in the age of the internet
|
||||
have done and made text and binary data distinct types that cannot blindly be
|
||||
mixed together (Python predates widespread access to the internet). For any code
|
||||
that deals only with text or only binary data, this separation doesn't pose an
|
||||
issue. But for code that has to deal with both, it does mean you might have to
|
||||
now care about when you are using text compared to binary data, which is why
|
||||
this cannot be entirely automated.
|
||||
|
||||
To start, you will need to decide which APIs take text and which take binary
|
||||
(it is **highly** recommended you don't design APIs that can take both due to
|
||||
the difficulty of keeping the code working; as stated earlier it is difficult to
|
||||
do well). In Python 2 this means making sure the APIs that take text can work
|
||||
with ``unicode`` and those that work with binary data work with the
|
||||
``bytes`` type from Python 3 (which is a subset of ``str`` in Python 2 and acts
|
||||
as an alias for ``bytes`` type in Python 2). Usually the biggest issue is
|
||||
realizing which methods exist on which types in Python 2 & 3 simultaneously
|
||||
(for text that's ``unicode`` in Python 2 and ``str`` in Python 3, for binary
|
||||
that's ``str``/``bytes`` in Python 2 and ``bytes`` in Python 3). The following
|
||||
table lists the **unique** methods of each data type across Python 2 & 3
|
||||
(e.g., the ``decode()`` method is usable on the equivalent binary data type in
|
||||
either Python 2 or 3, but it can't be used by the textual data type consistently
|
||||
between Python 2 and 3 because ``str`` in Python 3 doesn't have the method). Do
|
||||
note that as of Python 3.5 the ``__mod__`` method was added to the bytes type.
|
||||
|
||||
======================== =====================
|
||||
**Text data** **Binary data**
|
||||
------------------------ ---------------------
|
||||
\ decode
|
||||
------------------------ ---------------------
|
||||
encode
|
||||
------------------------ ---------------------
|
||||
format
|
||||
------------------------ ---------------------
|
||||
isdecimal
|
||||
------------------------ ---------------------
|
||||
isnumeric
|
||||
======================== =====================
|
||||
|
||||
Making the distinction easier to handle can be accomplished by encoding and
|
||||
decoding between binary data and text at the edge of your code. This means that
|
||||
when you receive text in binary data, you should immediately decode it. And if
|
||||
your code needs to send text as binary data then encode it as late as possible.
|
||||
This allows your code to work with only text internally and thus eliminates
|
||||
having to keep track of what type of data you are working with.
|
||||
|
||||
The next issue is making sure you know whether the string literals in your code
|
||||
represent text or binary data. You should add a ``b`` prefix to any
|
||||
literal that presents binary data. For text you should add a ``u`` prefix to
|
||||
the text literal. (there is a :mod:`__future__` import to force all unspecified
|
||||
literals to be Unicode, but usage has shown it isn't as effective as adding a
|
||||
``b`` or ``u`` prefix to all literals explicitly)
|
||||
|
||||
As part of this dichotomy you also need to be careful about opening files.
|
||||
Unless you have been working on Windows, there is a chance you have not always
|
||||
bothered to add the ``b`` mode when opening a binary file (e.g., ``rb`` for
|
||||
binary reading). Under Python 3, binary files and text files are clearly
|
||||
distinct and mutually incompatible; see the :mod:`io` module for details.
|
||||
Therefore, you **must** make a decision of whether a file will be used for
|
||||
binary access (allowing binary data to be read and/or written) or textual access
|
||||
(allowing text data to be read and/or written). You should also use :func:`io.open`
|
||||
for opening files instead of the built-in :func:`open` function as the :mod:`io`
|
||||
module is consistent from Python 2 to 3 while the built-in :func:`open` function
|
||||
is not (in Python 3 it's actually :func:`io.open`). Do not bother with the
|
||||
outdated practice of using :func:`codecs.open` as that's only necessary for
|
||||
keeping compatibility with Python 2.5.
|
||||
|
||||
The constructors of both ``str`` and ``bytes`` have different semantics for the
|
||||
same arguments between Python 2 & 3. Passing an integer to ``bytes`` in Python 2
|
||||
will give you the string representation of the integer: ``bytes(3) == '3'``.
|
||||
But in Python 3, an integer argument to ``bytes`` will give you a bytes object
|
||||
as long as the integer specified, filled with null bytes:
|
||||
``bytes(3) == b'\x00\x00\x00'``. A similar worry is necessary when passing a
|
||||
bytes object to ``str``. In Python 2 you just get the bytes object back:
|
||||
``str(b'3') == b'3'``. But in Python 3 you get the string representation of the
|
||||
bytes object: ``str(b'3') == "b'3'"``.
|
||||
|
||||
Finally, the indexing of binary data requires careful handling (slicing does
|
||||
**not** require any special handling). In Python 2,
|
||||
``b'123'[1] == b'2'`` while in Python 3 ``b'123'[1] == 50``. Because binary data
|
||||
is simply a collection of binary numbers, Python 3 returns the integer value for
|
||||
the byte you index on. But in Python 2 because ``bytes == str``, indexing
|
||||
returns a one-item slice of bytes. The six_ project has a function
|
||||
named ``six.indexbytes()`` which will return an integer like in Python 3:
|
||||
``six.indexbytes(b'123', 1)``.
|
||||
|
||||
To summarize:
|
||||
|
||||
#. Decide which of your APIs take text and which take binary data
|
||||
#. Make sure that your code that works with text also works with ``unicode`` and
|
||||
code for binary data works with ``bytes`` in Python 2 (see the table above
|
||||
for what methods you cannot use for each type)
|
||||
#. Mark all binary literals with a ``b`` prefix, textual literals with a ``u``
|
||||
prefix
|
||||
#. Decode binary data to text as soon as possible, encode text as binary data as
|
||||
late as possible
|
||||
#. Open files using :func:`io.open` and make sure to specify the ``b`` mode when
|
||||
appropriate
|
||||
#. Be careful when indexing into binary data
|
||||
|
||||
|
||||
Use feature detection instead of version detection
|
||||
++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Inevitably you will have code that has to choose what to do based on what
|
||||
version of Python is running. The best way to do this is with feature detection
|
||||
of whether the version of Python you're running under supports what you need.
|
||||
If for some reason that doesn't work then you should make the version check be
|
||||
against Python 2 and not Python 3. To help explain this, let's look at an
|
||||
example.
|
||||
|
||||
Let's pretend that you need access to a feature of importlib_ that
|
||||
is available in Python's standard library since Python 3.3 and available for
|
||||
Python 2 through importlib2_ on PyPI. You might be tempted to write code to
|
||||
access e.g. the ``importlib.abc`` module by doing the following::
|
||||
|
||||
import sys
|
||||
|
||||
if sys.version_info[0] == 3:
|
||||
from importlib import abc
|
||||
else:
|
||||
from importlib2 import abc
|
||||
|
||||
The problem with this code is what happens when Python 4 comes out? It would
|
||||
be better to treat Python 2 as the exceptional case instead of Python 3 and
|
||||
assume that future Python versions will be more compatible with Python 3 than
|
||||
Python 2::
|
||||
|
||||
import sys
|
||||
|
||||
if sys.version_info[0] > 2:
|
||||
from importlib import abc
|
||||
else:
|
||||
from importlib2 import abc
|
||||
|
||||
The best solution, though, is to do no version detection at all and instead rely
|
||||
on feature detection. That avoids any potential issues of getting the version
|
||||
detection wrong and helps keep you future-compatible::
|
||||
|
||||
try:
|
||||
from importlib import abc
|
||||
except ImportError:
|
||||
from importlib2 import abc
|
||||
|
||||
|
||||
Prevent compatibility regressions
|
||||
---------------------------------
|
||||
|
||||
Once you have fully translated your code to be compatible with Python 3, you
|
||||
will want to make sure your code doesn't regress and stop working under
|
||||
Python 3. This is especially true if you have a dependency which is blocking you
|
||||
from actually running under Python 3 at the moment.
|
||||
|
||||
To help with staying compatible, any new modules you create should have
|
||||
at least the following block of code at the top of it::
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
You can also run Python 2 with the ``-3`` flag to be warned about various
|
||||
compatibility issues your code triggers during execution. If you turn warnings
|
||||
into errors with ``-Werror`` then you can make sure that you don't accidentally
|
||||
miss a warning.
|
||||
|
||||
You can also use the Pylint_ project and its ``--py3k`` flag to lint your code
|
||||
to receive warnings when your code begins to deviate from Python 3
|
||||
compatibility. This also prevents you from having to run Modernize_ or Futurize_
|
||||
over your code regularly to catch compatibility regressions. This does require
|
||||
you only support Python 2.7 and Python 3.4 or newer as that is Pylint's
|
||||
minimum Python version support.
|
||||
|
||||
|
||||
Check which dependencies block your transition
|
||||
----------------------------------------------
|
||||
|
||||
**After** you have made your code compatible with Python 3 you should begin to
|
||||
care about whether your dependencies have also been ported. The caniusepython3_
|
||||
project was created to help you determine which projects
|
||||
-- directly or indirectly -- are blocking you from supporting Python 3. There
|
||||
is both a command-line tool as well as a web interface at
|
||||
https://caniusepython3.com.
|
||||
|
||||
The project also provides code which you can integrate into your test suite so
|
||||
that you will have a failing test when you no longer have dependencies blocking
|
||||
you from using Python 3. This allows you to avoid having to manually check your
|
||||
dependencies and to be notified quickly when you can start running on Python 3.
|
||||
|
||||
|
||||
Update your ``setup.py`` file to denote Python 3 compatibility
|
||||
--------------------------------------------------------------
|
||||
|
||||
Once your code works under Python 3, you should update the classifiers in
|
||||
your ``setup.py`` to contain ``Programming Language :: Python :: 3`` and to not
|
||||
specify sole Python 2 support. This will tell anyone using your code that you
|
||||
support Python 2 **and** 3. Ideally you will also want to add classifiers for
|
||||
each major/minor version of Python you now support.
|
||||
|
||||
|
||||
Use continuous integration to stay compatible
|
||||
---------------------------------------------
|
||||
|
||||
Once you are able to fully run under Python 3 you will want to make sure your
|
||||
code always works under both Python 2 & 3. Probably the best tool for running
|
||||
your tests under multiple Python interpreters is tox_. You can then integrate
|
||||
tox with your continuous integration system so that you never accidentally break
|
||||
Python 2 or 3 support.
|
||||
|
||||
You may also want to use the ``-bb`` flag with the Python 3 interpreter to
|
||||
trigger an exception when you are comparing bytes to strings or bytes to an int
|
||||
(the latter is available starting in Python 3.5). By default type-differing
|
||||
comparisons simply return ``False``, but if you made a mistake in your
|
||||
separation of text/binary data handling or indexing on bytes you wouldn't easily
|
||||
find the mistake. This flag will raise an exception when these kinds of
|
||||
comparisons occur, making the mistake much easier to track down.
|
||||
|
||||
And that's mostly it! At this point your code base is compatible with both
|
||||
Python 2 and 3 simultaneously. Your testing will also be set up so that you
|
||||
don't accidentally break Python 2 or 3 compatibility regardless of which version
|
||||
you typically run your tests under while developing.
|
||||
|
||||
|
||||
Consider using optional static type checking
|
||||
--------------------------------------------
|
||||
|
||||
Another way to help port your code is to use a static type checker like
|
||||
mypy_ or pytype_ on your code. These tools can be used to analyze your code as
|
||||
if it's being run under Python 2, then you can run the tool a second time as if
|
||||
your code is running under Python 3. By running a static type checker twice like
|
||||
this you can discover if you're e.g. misusing binary data type in one version
|
||||
of Python compared to another. If you add optional type hints to your code you
|
||||
can also explicitly state whether your APIs use textual or binary data, helping
|
||||
to make sure everything functions as expected in both versions of Python.
|
||||
|
||||
|
||||
.. _2to3: https://docs.python.org/3/library/2to3.html
|
||||
.. _caniusepython3: https://pypi.org/project/caniusepython3
|
||||
.. _cheat sheet: http://python-future.org/compatible_idioms.html
|
||||
.. _coverage.py: https://pypi.org/project/coverage
|
||||
.. _Futurize: http://python-future.org/automatic_conversion.html
|
||||
.. _importlib: https://docs.python.org/3/library/importlib.html#module-importlib
|
||||
.. _importlib2: https://pypi.org/project/importlib2
|
||||
.. _Modernize: https://python-modernize.readthedocs.org/en/latest/
|
||||
.. _mypy: http://mypy-lang.org/
|
||||
.. _Porting to Python 3: http://python3porting.com/
|
||||
.. _Pylint: https://pypi.org/project/pylint
|
||||
|
||||
.. _Python 3 Q & A: https://ncoghlan-devs-python-notes.readthedocs.org/en/latest/python3/questions_and_answers.html
|
||||
|
||||
.. _pytype: https://github.com/google/pytype
|
||||
.. _python-future: http://python-future.org/
|
||||
.. _python-porting: https://mail.python.org/mailman/listinfo/python-porting
|
||||
.. _six: https://pypi.org/project/six
|
||||
.. _tox: https://pypi.org/project/tox
|
||||
.. _trove classifier: https://pypi.org/classifiers
|
||||
|
||||
.. _"What's New": https://docs.python.org/3/whatsnew/index.html
|
||||
|
||||
.. _Why Python 3 exists: http://www.snarky.ca/why-python-3-exists
|
||||
1379
Doc/howto/regex.rst
Normal file
1379
Doc/howto/regex.rst
Normal file
File diff suppressed because it is too large
Load Diff
418
Doc/howto/sockets.rst
Normal file
418
Doc/howto/sockets.rst
Normal file
@@ -0,0 +1,418 @@
|
||||
.. _socket-howto:
|
||||
|
||||
****************************
|
||||
Socket Programming HOWTO
|
||||
****************************
|
||||
|
||||
:Author: Gordon McMillan
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
Sockets are used nearly everywhere, but are one of the most severely
|
||||
misunderstood technologies around. This is a 10,000 foot overview of sockets.
|
||||
It's not really a tutorial - you'll still have work to do in getting things
|
||||
operational. It doesn't cover the fine points (and there are a lot of them), but
|
||||
I hope it will give you enough background to begin using them decently.
|
||||
|
||||
|
||||
Sockets
|
||||
=======
|
||||
|
||||
I'm only going to talk about INET sockets, but they account for at least 99% of
|
||||
the sockets in use. And I'll only talk about STREAM sockets - unless you really
|
||||
know what you're doing (in which case this HOWTO isn't for you!), you'll get
|
||||
better behavior and performance from a STREAM socket than anything else. I will
|
||||
try to clear up the mystery of what a socket is, as well as some hints on how to
|
||||
work with blocking and non-blocking sockets. But I'll start by talking about
|
||||
blocking sockets. You'll need to know how they work before dealing with
|
||||
non-blocking sockets.
|
||||
|
||||
Part of the trouble with understanding these things is that "socket" can mean a
|
||||
number of subtly different things, depending on context. So first, let's make a
|
||||
distinction between a "client" socket - an endpoint of a conversation, and a
|
||||
"server" socket, which is more like a switchboard operator. The client
|
||||
application (your browser, for example) uses "client" sockets exclusively; the
|
||||
web server it's talking to uses both "server" sockets and "client" sockets.
|
||||
|
||||
|
||||
History
|
||||
-------
|
||||
|
||||
Of the various forms of :abbr:`IPC (Inter Process Communication)`,
|
||||
sockets are by far the most popular. On any given platform, there are
|
||||
likely to be other forms of IPC that are faster, but for
|
||||
cross-platform communication, sockets are about the only game in town.
|
||||
|
||||
They were invented in Berkeley as part of the BSD flavor of Unix. They spread
|
||||
like wildfire with the Internet. With good reason --- the combination of sockets
|
||||
with INET makes talking to arbitrary machines around the world unbelievably easy
|
||||
(at least compared to other schemes).
|
||||
|
||||
|
||||
Creating a Socket
|
||||
=================
|
||||
|
||||
Roughly speaking, when you clicked on the link that brought you to this page,
|
||||
your browser did something like the following::
|
||||
|
||||
#create an INET, STREAMing socket
|
||||
s = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
#now connect to the web server on port 80
|
||||
# - the normal http port
|
||||
s.connect(("www.mcmillan-inc.com", 80))
|
||||
|
||||
When the ``connect`` completes, the socket ``s`` can be used to send
|
||||
in a request for the text of the page. The same socket will read the
|
||||
reply, and then be destroyed. That's right, destroyed. Client sockets
|
||||
are normally only used for one exchange (or a small set of sequential
|
||||
exchanges).
|
||||
|
||||
What happens in the web server is a bit more complex. First, the web server
|
||||
creates a "server socket"::
|
||||
|
||||
#create an INET, STREAMing socket
|
||||
serversocket = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
#bind the socket to a public host,
|
||||
# and a well-known port
|
||||
serversocket.bind((socket.gethostname(), 80))
|
||||
#become a server socket
|
||||
serversocket.listen(5)
|
||||
|
||||
A couple things to notice: we used ``socket.gethostname()`` so that the socket
|
||||
would be visible to the outside world. If we had used ``s.bind(('localhost',
|
||||
80))`` or ``s.bind(('127.0.0.1', 80))`` we would still have a "server" socket,
|
||||
but one that was only visible within the same machine. ``s.bind(('', 80))``
|
||||
specifies that the socket is reachable by any address the machine happens to
|
||||
have.
|
||||
|
||||
A second thing to note: low number ports are usually reserved for "well known"
|
||||
services (HTTP, SNMP etc). If you're playing around, use a nice high number (4
|
||||
digits).
|
||||
|
||||
Finally, the argument to ``listen`` tells the socket library that we want it to
|
||||
queue up as many as 5 connect requests (the normal max) before refusing outside
|
||||
connections. If the rest of the code is written properly, that should be plenty.
|
||||
|
||||
Now that we have a "server" socket, listening on port 80, we can enter the
|
||||
mainloop of the web server::
|
||||
|
||||
while 1:
|
||||
#accept connections from outside
|
||||
(clientsocket, address) = serversocket.accept()
|
||||
#now do something with the clientsocket
|
||||
#in this case, we'll pretend this is a threaded server
|
||||
ct = client_thread(clientsocket)
|
||||
ct.run()
|
||||
|
||||
There's actually 3 general ways in which this loop could work - dispatching a
|
||||
thread to handle ``clientsocket``, create a new process to handle
|
||||
``clientsocket``, or restructure this app to use non-blocking sockets, and
|
||||
multiplex between our "server" socket and any active ``clientsocket``\ s using
|
||||
``select``. More about that later. The important thing to understand now is
|
||||
this: this is *all* a "server" socket does. It doesn't send any data. It doesn't
|
||||
receive any data. It just produces "client" sockets. Each ``clientsocket`` is
|
||||
created in response to some *other* "client" socket doing a ``connect()`` to the
|
||||
host and port we're bound to. As soon as we've created that ``clientsocket``, we
|
||||
go back to listening for more connections. The two "clients" are free to chat it
|
||||
up - they are using some dynamically allocated port which will be recycled when
|
||||
the conversation ends.
|
||||
|
||||
|
||||
IPC
|
||||
---
|
||||
|
||||
If you need fast IPC between two processes on one machine, you should look into
|
||||
whatever form of shared memory the platform offers. A simple protocol based
|
||||
around shared memory and locks or semaphores is by far the fastest technique.
|
||||
|
||||
If you do decide to use sockets, bind the "server" socket to ``'localhost'``. On
|
||||
most platforms, this will take a shortcut around a couple of layers of network
|
||||
code and be quite a bit faster.
|
||||
|
||||
|
||||
Using a Socket
|
||||
==============
|
||||
|
||||
The first thing to note, is that the web browser's "client" socket and the web
|
||||
server's "client" socket are identical beasts. That is, this is a "peer to peer"
|
||||
conversation. Or to put it another way, *as the designer, you will have to
|
||||
decide what the rules of etiquette are for a conversation*. Normally, the
|
||||
``connect``\ ing socket starts the conversation, by sending in a request, or
|
||||
perhaps a signon. But that's a design decision - it's not a rule of sockets.
|
||||
|
||||
Now there are two sets of verbs to use for communication. You can use ``send``
|
||||
and ``recv``, or you can transform your client socket into a file-like beast and
|
||||
use ``read`` and ``write``. The latter is the way Java presents its sockets.
|
||||
I'm not going to talk about it here, except to warn you that you need to use
|
||||
``flush`` on sockets. These are buffered "files", and a common mistake is to
|
||||
``write`` something, and then ``read`` for a reply. Without a ``flush`` in
|
||||
there, you may wait forever for the reply, because the request may still be in
|
||||
your output buffer.
|
||||
|
||||
Now we come to the major stumbling block of sockets - ``send`` and ``recv`` operate
|
||||
on the network buffers. They do not necessarily handle all the bytes you hand
|
||||
them (or expect from them), because their major focus is handling the network
|
||||
buffers. In general, they return when the associated network buffers have been
|
||||
filled (``send``) or emptied (``recv``). They then tell you how many bytes they
|
||||
handled. It is *your* responsibility to call them again until your message has
|
||||
been completely dealt with.
|
||||
|
||||
When a ``recv`` returns 0 bytes, it means the other side has closed (or is in
|
||||
the process of closing) the connection. You will not receive any more data on
|
||||
this connection. Ever. You may be able to send data successfully; I'll talk
|
||||
more about this later.
|
||||
|
||||
A protocol like HTTP uses a socket for only one transfer. The client sends a
|
||||
request, then reads a reply. That's it. The socket is discarded. This means that
|
||||
a client can detect the end of the reply by receiving 0 bytes.
|
||||
|
||||
But if you plan to reuse your socket for further transfers, you need to realize
|
||||
that *there is no* :abbr:`EOT (End of Transfer)` *on a socket.* I repeat: if a socket
|
||||
``send`` or ``recv`` returns after handling 0 bytes, the connection has been
|
||||
broken. If the connection has *not* been broken, you may wait on a ``recv``
|
||||
forever, because the socket will *not* tell you that there's nothing more to
|
||||
read (for now). Now if you think about that a bit, you'll come to realize a
|
||||
fundamental truth of sockets: *messages must either be fixed length* (yuck), *or
|
||||
be delimited* (shrug), *or indicate how long they are* (much better), *or end by
|
||||
shutting down the connection*. The choice is entirely yours, (but some ways are
|
||||
righter than others).
|
||||
|
||||
Assuming you don't want to end the connection, the simplest solution is a fixed
|
||||
length message::
|
||||
|
||||
class mysocket:
|
||||
'''demonstration class only
|
||||
- coded for clarity, not efficiency
|
||||
'''
|
||||
|
||||
def __init__(self, sock=None):
|
||||
if sock is None:
|
||||
self.sock = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
else:
|
||||
self.sock = sock
|
||||
|
||||
def connect(self, host, port):
|
||||
self.sock.connect((host, port))
|
||||
|
||||
def mysend(self, msg):
|
||||
totalsent = 0
|
||||
while totalsent < MSGLEN:
|
||||
sent = self.sock.send(msg[totalsent:])
|
||||
if sent == 0:
|
||||
raise RuntimeError("socket connection broken")
|
||||
totalsent = totalsent + sent
|
||||
|
||||
def myreceive(self):
|
||||
chunks = []
|
||||
bytes_recd = 0
|
||||
while bytes_recd < MSGLEN:
|
||||
chunk = self.sock.recv(min(MSGLEN - bytes_recd, 2048))
|
||||
if chunk == '':
|
||||
raise RuntimeError("socket connection broken")
|
||||
chunks.append(chunk)
|
||||
bytes_recd = bytes_recd + len(chunk)
|
||||
return ''.join(chunks)
|
||||
|
||||
The sending code here is usable for almost any messaging scheme - in Python you
|
||||
send strings, and you can use ``len()`` to determine its length (even if it has
|
||||
embedded ``\0`` characters). It's mostly the receiving code that gets more
|
||||
complex. (And in C, it's not much worse, except you can't use ``strlen`` if the
|
||||
message has embedded ``\0``\ s.)
|
||||
|
||||
The easiest enhancement is to make the first character of the message an
|
||||
indicator of message type, and have the type determine the length. Now you have
|
||||
two ``recv``\ s - the first to get (at least) that first character so you can
|
||||
look up the length, and the second in a loop to get the rest. If you decide to
|
||||
go the delimited route, you'll be receiving in some arbitrary chunk size, (4096
|
||||
or 8192 is frequently a good match for network buffer sizes), and scanning what
|
||||
you've received for a delimiter.
|
||||
|
||||
One complication to be aware of: if your conversational protocol allows multiple
|
||||
messages to be sent back to back (without some kind of reply), and you pass
|
||||
``recv`` an arbitrary chunk size, you may end up reading the start of a
|
||||
following message. You'll need to put that aside and hold onto it, until it's
|
||||
needed.
|
||||
|
||||
Prefixing the message with its length (say, as 5 numeric characters) gets more
|
||||
complex, because (believe it or not), you may not get all 5 characters in one
|
||||
``recv``. In playing around, you'll get away with it; but in high network loads,
|
||||
your code will very quickly break unless you use two ``recv`` loops - the first
|
||||
to determine the length, the second to get the data part of the message. Nasty.
|
||||
This is also when you'll discover that ``send`` does not always manage to get
|
||||
rid of everything in one pass. And despite having read this, you will eventually
|
||||
get bit by it!
|
||||
|
||||
In the interests of space, building your character, (and preserving my
|
||||
competitive position), these enhancements are left as an exercise for the
|
||||
reader. Lets move on to cleaning up.
|
||||
|
||||
|
||||
Binary Data
|
||||
-----------
|
||||
|
||||
It is perfectly possible to send binary data over a socket. The major problem is
|
||||
that not all machines use the same formats for binary data. For example, a
|
||||
Motorola chip will represent a 16 bit integer with the value 1 as the two hex
|
||||
bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00.
|
||||
Socket libraries have calls for converting 16 and 32 bit integers - ``ntohl,
|
||||
htonl, ntohs, htons`` where "n" means *network* and "h" means *host*, "s" means
|
||||
*short* and "l" means *long*. Where network order is host order, these do
|
||||
nothing, but where the machine is byte-reversed, these swap the bytes around
|
||||
appropriately.
|
||||
|
||||
In these days of 32 bit machines, the ascii representation of binary data is
|
||||
frequently smaller than the binary representation. That's because a surprising
|
||||
amount of the time, all those longs have the value 0, or maybe 1. The string "0"
|
||||
would be two bytes, while binary is four. Of course, this doesn't fit well with
|
||||
fixed-length messages. Decisions, decisions.
|
||||
|
||||
|
||||
Disconnecting
|
||||
=============
|
||||
|
||||
Strictly speaking, you're supposed to use ``shutdown`` on a socket before you
|
||||
``close`` it. The ``shutdown`` is an advisory to the socket at the other end.
|
||||
Depending on the argument you pass it, it can mean "I'm not going to send
|
||||
anymore, but I'll still listen", or "I'm not listening, good riddance!". Most
|
||||
socket libraries, however, are so used to programmers neglecting to use this
|
||||
piece of etiquette that normally a ``close`` is the same as ``shutdown();
|
||||
close()``. So in most situations, an explicit ``shutdown`` is not needed.
|
||||
|
||||
One way to use ``shutdown`` effectively is in an HTTP-like exchange. The client
|
||||
sends a request and then does a ``shutdown(1)``. This tells the server "This
|
||||
client is done sending, but can still receive." The server can detect "EOF" by
|
||||
a receive of 0 bytes. It can assume it has the complete request. The server
|
||||
sends a reply. If the ``send`` completes successfully then, indeed, the client
|
||||
was still receiving.
|
||||
|
||||
Python takes the automatic shutdown a step further, and says that when a socket
|
||||
is garbage collected, it will automatically do a ``close`` if it's needed. But
|
||||
relying on this is a very bad habit. If your socket just disappears without
|
||||
doing a ``close``, the socket at the other end may hang indefinitely, thinking
|
||||
you're just being slow. *Please* ``close`` your sockets when you're done.
|
||||
|
||||
|
||||
When Sockets Die
|
||||
----------------
|
||||
|
||||
Probably the worst thing about using blocking sockets is what happens when the
|
||||
other side comes down hard (without doing a ``close``). Your socket is likely to
|
||||
hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time
|
||||
before giving up on a connection. If you're using threads, the entire thread is
|
||||
essentially dead. There's not much you can do about it. As long as you aren't
|
||||
doing something dumb, like holding a lock while doing a blocking read, the
|
||||
thread isn't really consuming much in the way of resources. Do *not* try to kill
|
||||
the thread - part of the reason that threads are more efficient than processes
|
||||
is that they avoid the overhead associated with the automatic recycling of
|
||||
resources. In other words, if you do manage to kill the thread, your whole
|
||||
process is likely to be screwed up.
|
||||
|
||||
|
||||
Non-blocking Sockets
|
||||
====================
|
||||
|
||||
If you've understood the preceding, you already know most of what you need to
|
||||
know about the mechanics of using sockets. You'll still use the same calls, in
|
||||
much the same ways. It's just that, if you do it right, your app will be almost
|
||||
inside-out.
|
||||
|
||||
In Python, you use ``socket.setblocking(0)`` to make it non-blocking. In C, it's
|
||||
more complex, (for one thing, you'll need to choose between the BSD flavor
|
||||
``O_NONBLOCK`` and the almost indistinguishable Posix flavor ``O_NDELAY``, which
|
||||
is completely different from ``TCP_NODELAY``), but it's the exact same idea. You
|
||||
do this after creating the socket, but before using it. (Actually, if you're
|
||||
nuts, you can switch back and forth.)
|
||||
|
||||
The major mechanical difference is that ``send``, ``recv``, ``connect`` and
|
||||
``accept`` can return without having done anything. You have (of course) a
|
||||
number of choices. You can check return code and error codes and generally drive
|
||||
yourself crazy. If you don't believe me, try it sometime. Your app will grow
|
||||
large, buggy and suck CPU. So let's skip the brain-dead solutions and do it
|
||||
right.
|
||||
|
||||
Use ``select``.
|
||||
|
||||
In C, coding ``select`` is fairly complex. In Python, it's a piece of cake, but
|
||||
it's close enough to the C version that if you understand ``select`` in Python,
|
||||
you'll have little trouble with it in C::
|
||||
|
||||
ready_to_read, ready_to_write, in_error = \
|
||||
select.select(
|
||||
potential_readers,
|
||||
potential_writers,
|
||||
potential_errs,
|
||||
timeout)
|
||||
|
||||
You pass ``select`` three lists: the first contains all sockets that you might
|
||||
want to try reading; the second all the sockets you might want to try writing
|
||||
to, and the last (normally left empty) those that you want to check for errors.
|
||||
You should note that a socket can go into more than one list. The ``select``
|
||||
call is blocking, but you can give it a timeout. This is generally a sensible
|
||||
thing to do - give it a nice long timeout (say a minute) unless you have good
|
||||
reason to do otherwise.
|
||||
|
||||
In return, you will get three lists. They contain the sockets that are actually
|
||||
readable, writable and in error. Each of these lists is a subset (possibly
|
||||
empty) of the corresponding list you passed in.
|
||||
|
||||
If a socket is in the output readable list, you can be
|
||||
as-close-to-certain-as-we-ever-get-in-this-business that a ``recv`` on that
|
||||
socket will return *something*. Same idea for the writable list. You'll be able
|
||||
to send *something*. Maybe not all you want to, but *something* is better than
|
||||
nothing. (Actually, any reasonably healthy socket will return as writable - it
|
||||
just means outbound network buffer space is available.)
|
||||
|
||||
If you have a "server" socket, put it in the potential_readers list. If it comes
|
||||
out in the readable list, your ``accept`` will (almost certainly) work. If you
|
||||
have created a new socket to ``connect`` to someone else, put it in the
|
||||
potential_writers list. If it shows up in the writable list, you have a decent
|
||||
chance that it has connected.
|
||||
|
||||
One very nasty problem with ``select``: if somewhere in those input lists of
|
||||
sockets is one which has died a nasty death, the ``select`` will fail. You then
|
||||
need to loop through every single damn socket in all those lists and do a
|
||||
``select([sock],[],[],0)`` until you find the bad one. That timeout of 0 means
|
||||
it won't take long, but it's ugly.
|
||||
|
||||
Actually, ``select`` can be handy even with blocking sockets. It's one way of
|
||||
determining whether you will block - the socket returns as readable when there's
|
||||
something in the buffers. However, this still doesn't help with the problem of
|
||||
determining whether the other end is done, or just busy with something else.
|
||||
|
||||
**Portability alert**: On Unix, ``select`` works both with the sockets and
|
||||
files. Don't try this on Windows. On Windows, ``select`` works with sockets
|
||||
only. Also note that in C, many of the more advanced socket options are done
|
||||
differently on Windows. In fact, on Windows I usually use threads (which work
|
||||
very, very well) with my sockets. Face it, if you want any kind of performance,
|
||||
your code will look very different on Windows than on Unix.
|
||||
|
||||
|
||||
Performance
|
||||
-----------
|
||||
|
||||
There's no question that the fastest sockets code uses non-blocking sockets and
|
||||
select to multiplex them. You can put together something that will saturate a
|
||||
LAN connection without putting any strain on the CPU. The trouble is that an app
|
||||
written this way can't do much of anything else - it needs to be ready to
|
||||
shuffle bytes around at all times.
|
||||
|
||||
Assuming that your app is actually supposed to do something more than that,
|
||||
threading is the optimal solution, (and using non-blocking sockets will be
|
||||
faster than using blocking sockets). Unfortunately, threading support in Unixes
|
||||
varies both in API and quality. So the normal Unix solution is to fork a
|
||||
subprocess to deal with each connection. The overhead for this is significant
|
||||
(and don't do this on Windows - the overhead of process creation is enormous
|
||||
there). It also means that unless each subprocess is completely independent,
|
||||
you'll need to use another form of IPC, say a pipe, or shared memory and
|
||||
semaphores, to communicate between the parent and child processes.
|
||||
|
||||
Finally, remember that even though blocking sockets are somewhat slower than
|
||||
non-blocking, in many cases they are the "right" solution. After all, if your
|
||||
app is driven by the data it receives over a socket, there's not much sense in
|
||||
complicating the logic just so your app can wait on ``select`` instead of
|
||||
``recv``.
|
||||
|
||||
314
Doc/howto/sorting.rst
Normal file
314
Doc/howto/sorting.rst
Normal file
@@ -0,0 +1,314 @@
|
||||
.. _sortinghowto:
|
||||
|
||||
Sorting HOW TO
|
||||
**************
|
||||
|
||||
:Author: Andrew Dalke and Raymond Hettinger
|
||||
:Release: 0.1
|
||||
|
||||
|
||||
Python lists have a built-in :meth:`list.sort` method that modifies the list
|
||||
in-place. There is also a :func:`sorted` built-in function that builds a new
|
||||
sorted list from an iterable.
|
||||
|
||||
In this document, we explore the various techniques for sorting data using Python.
|
||||
|
||||
|
||||
Sorting Basics
|
||||
==============
|
||||
|
||||
A simple ascending sort is very easy: just call the :func:`sorted` function. It
|
||||
returns a new sorted list::
|
||||
|
||||
>>> sorted([5, 2, 3, 1, 4])
|
||||
[1, 2, 3, 4, 5]
|
||||
|
||||
You can also use the :meth:`list.sort` method of a list. It modifies the list
|
||||
in-place (and returns ``None`` to avoid confusion). Usually it's less convenient
|
||||
than :func:`sorted` - but if you don't need the original list, it's slightly
|
||||
more efficient.
|
||||
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> a.sort()
|
||||
>>> a
|
||||
[1, 2, 3, 4, 5]
|
||||
|
||||
Another difference is that the :meth:`list.sort` method is only defined for
|
||||
lists. In contrast, the :func:`sorted` function accepts any iterable.
|
||||
|
||||
>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'})
|
||||
[1, 2, 3, 4, 5]
|
||||
|
||||
Key Functions
|
||||
=============
|
||||
|
||||
Starting with Python 2.4, both :meth:`list.sort` and :func:`sorted` added a
|
||||
*key* parameter to specify a function to be called on each list element prior to
|
||||
making comparisons.
|
||||
|
||||
For example, here's a case-insensitive string comparison:
|
||||
|
||||
>>> sorted("This is a test string from Andrew".split(), key=str.lower)
|
||||
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
|
||||
|
||||
The value of the *key* parameter should be a function that takes a single argument
|
||||
and returns a key to use for sorting purposes. This technique is fast because
|
||||
the key function is called exactly once for each input record.
|
||||
|
||||
A common pattern is to sort complex objects using some of the object's indices
|
||||
as keys. For example:
|
||||
|
||||
>>> student_tuples = [
|
||||
... ('john', 'A', 15),
|
||||
... ('jane', 'B', 12),
|
||||
... ('dave', 'B', 10),
|
||||
... ]
|
||||
>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age
|
||||
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
||||
|
||||
The same technique works for objects with named attributes. For example:
|
||||
|
||||
>>> class Student:
|
||||
... def __init__(self, name, grade, age):
|
||||
... self.name = name
|
||||
... self.grade = grade
|
||||
... self.age = age
|
||||
... def __repr__(self):
|
||||
... return repr((self.name, self.grade, self.age))
|
||||
|
||||
>>> student_objects = [
|
||||
... Student('john', 'A', 15),
|
||||
... Student('jane', 'B', 12),
|
||||
... Student('dave', 'B', 10),
|
||||
... ]
|
||||
>>> sorted(student_objects, key=lambda student: student.age) # sort by age
|
||||
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
||||
|
||||
Operator Module Functions
|
||||
=========================
|
||||
|
||||
The key-function patterns shown above are very common, so Python provides
|
||||
convenience functions to make accessor functions easier and faster. The operator
|
||||
module has :func:`operator.itemgetter`, :func:`operator.attrgetter`, and
|
||||
starting in Python 2.5 an :func:`operator.methodcaller` function.
|
||||
|
||||
Using those functions, the above examples become simpler and faster:
|
||||
|
||||
>>> from operator import itemgetter, attrgetter
|
||||
|
||||
>>> sorted(student_tuples, key=itemgetter(2))
|
||||
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
||||
|
||||
>>> sorted(student_objects, key=attrgetter('age'))
|
||||
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
||||
|
||||
The operator module functions allow multiple levels of sorting. For example, to
|
||||
sort by *grade* then by *age*:
|
||||
|
||||
>>> sorted(student_tuples, key=itemgetter(1,2))
|
||||
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
|
||||
|
||||
>>> sorted(student_objects, key=attrgetter('grade', 'age'))
|
||||
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
|
||||
|
||||
The :func:`operator.methodcaller` function makes method calls with fixed
|
||||
parameters for each object being sorted. For example, the :meth:`str.count`
|
||||
method could be used to compute message priority by counting the
|
||||
number of exclamation marks in a message:
|
||||
|
||||
>>> from operator import methodcaller
|
||||
>>> messages = ['critical!!!', 'hurry!', 'standby', 'immediate!!']
|
||||
>>> sorted(messages, key=methodcaller('count', '!'))
|
||||
['standby', 'hurry!', 'immediate!!', 'critical!!!']
|
||||
|
||||
Ascending and Descending
|
||||
========================
|
||||
|
||||
Both :meth:`list.sort` and :func:`sorted` accept a *reverse* parameter with a
|
||||
boolean value. This is used to flag descending sorts. For example, to get the
|
||||
student data in reverse *age* order:
|
||||
|
||||
>>> sorted(student_tuples, key=itemgetter(2), reverse=True)
|
||||
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
|
||||
|
||||
>>> sorted(student_objects, key=attrgetter('age'), reverse=True)
|
||||
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
|
||||
|
||||
Sort Stability and Complex Sorts
|
||||
================================
|
||||
|
||||
Starting with Python 2.2, sorts are guaranteed to be `stable
|
||||
<https://en.wikipedia.org/wiki/Sorting_algorithm#Stability>`_\. That means that
|
||||
when multiple records have the same key, their original order is preserved.
|
||||
|
||||
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
|
||||
>>> sorted(data, key=itemgetter(0))
|
||||
[('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]
|
||||
|
||||
Notice how the two records for *blue* retain their original order so that
|
||||
``('blue', 1)`` is guaranteed to precede ``('blue', 2)``.
|
||||
|
||||
This wonderful property lets you build complex sorts in a series of sorting
|
||||
steps. For example, to sort the student data by descending *grade* and then
|
||||
ascending *age*, do the *age* sort first and then sort again using *grade*:
|
||||
|
||||
>>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key
|
||||
>>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending
|
||||
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
||||
|
||||
The `Timsort <https://en.wikipedia.org/wiki/Timsort>`_ algorithm used in Python
|
||||
does multiple sorts efficiently because it can take advantage of any ordering
|
||||
already present in a dataset.
|
||||
|
||||
The Old Way Using Decorate-Sort-Undecorate
|
||||
==========================================
|
||||
|
||||
This idiom is called Decorate-Sort-Undecorate after its three steps:
|
||||
|
||||
* First, the initial list is decorated with new values that control the sort order.
|
||||
|
||||
* Second, the decorated list is sorted.
|
||||
|
||||
* Finally, the decorations are removed, creating a list that contains only the
|
||||
initial values in the new order.
|
||||
|
||||
For example, to sort the student data by *grade* using the DSU approach:
|
||||
|
||||
>>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)]
|
||||
>>> decorated.sort()
|
||||
>>> [student for grade, i, student in decorated] # undecorate
|
||||
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
|
||||
|
||||
This idiom works because tuples are compared lexicographically; the first items
|
||||
are compared; if they are the same then the second items are compared, and so
|
||||
on.
|
||||
|
||||
It is not strictly necessary in all cases to include the index *i* in the
|
||||
decorated list, but including it gives two benefits:
|
||||
|
||||
* The sort is stable -- if two items have the same key, their order will be
|
||||
preserved in the sorted list.
|
||||
|
||||
* The original items do not have to be comparable because the ordering of the
|
||||
decorated tuples will be determined by at most the first two items. So for
|
||||
example the original list could contain complex numbers which cannot be sorted
|
||||
directly.
|
||||
|
||||
Another name for this idiom is
|
||||
`Schwartzian transform <https://en.wikipedia.org/wiki/Schwartzian_transform>`_\,
|
||||
after Randal L. Schwartz, who popularized it among Perl programmers.
|
||||
|
||||
For large lists and lists where the comparison information is expensive to
|
||||
calculate, and Python versions before 2.4, DSU is likely to be the fastest way
|
||||
to sort the list. For 2.4 and later, key functions provide the same
|
||||
functionality.
|
||||
|
||||
The Old Way Using the *cmp* Parameter
|
||||
=====================================
|
||||
|
||||
Many constructs given in this HOWTO assume Python 2.4 or later. Before that,
|
||||
there was no :func:`sorted` builtin and :meth:`list.sort` took no keyword
|
||||
arguments. Instead, all of the Py2.x versions supported a *cmp* parameter to
|
||||
handle user specified comparison functions.
|
||||
|
||||
In Python 3, the *cmp* parameter was removed entirely (as part of a larger effort to
|
||||
simplify and unify the language, eliminating the conflict between rich
|
||||
comparisons and the :meth:`__cmp__` magic method).
|
||||
|
||||
In Python 2, :meth:`~list.sort` allowed an optional function which can be called for doing the
|
||||
comparisons. That function should take two arguments to be compared and then
|
||||
return a negative value for less-than, return zero if they are equal, or return
|
||||
a positive value for greater-than. For example, we can do:
|
||||
|
||||
>>> def numeric_compare(x, y):
|
||||
... return x - y
|
||||
>>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare) # doctest: +SKIP
|
||||
[1, 2, 3, 4, 5]
|
||||
|
||||
Or you can reverse the order of comparison with:
|
||||
|
||||
>>> def reverse_numeric(x, y):
|
||||
... return y - x
|
||||
>>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric) # doctest: +SKIP
|
||||
[5, 4, 3, 2, 1]
|
||||
|
||||
When porting code from Python 2.x to 3.x, the situation can arise when you have
|
||||
the user supplying a comparison function and you need to convert that to a key
|
||||
function. The following wrapper makes that easy to do::
|
||||
|
||||
def cmp_to_key(mycmp):
|
||||
'Convert a cmp= function into a key= function'
|
||||
class K(object):
|
||||
def __init__(self, obj, *args):
|
||||
self.obj = obj
|
||||
def __lt__(self, other):
|
||||
return mycmp(self.obj, other.obj) < 0
|
||||
def __gt__(self, other):
|
||||
return mycmp(self.obj, other.obj) > 0
|
||||
def __eq__(self, other):
|
||||
return mycmp(self.obj, other.obj) == 0
|
||||
def __le__(self, other):
|
||||
return mycmp(self.obj, other.obj) <= 0
|
||||
def __ge__(self, other):
|
||||
return mycmp(self.obj, other.obj) >= 0
|
||||
def __ne__(self, other):
|
||||
return mycmp(self.obj, other.obj) != 0
|
||||
return K
|
||||
|
||||
To convert to a key function, just wrap the old comparison function:
|
||||
|
||||
.. testsetup::
|
||||
|
||||
from functools import cmp_to_key
|
||||
|
||||
.. doctest::
|
||||
|
||||
>>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric))
|
||||
[5, 4, 3, 2, 1]
|
||||
|
||||
In Python 2.7, the :func:`functools.cmp_to_key` function was added to the
|
||||
functools module.
|
||||
|
||||
Odd and Ends
|
||||
============
|
||||
|
||||
* For locale aware sorting, use :func:`locale.strxfrm` for a key function or
|
||||
:func:`locale.strcoll` for a comparison function.
|
||||
|
||||
* The *reverse* parameter still maintains sort stability (so that records with
|
||||
equal keys retain their original order). Interestingly, that effect can be
|
||||
simulated without the parameter by using the builtin :func:`reversed` function
|
||||
twice:
|
||||
|
||||
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
|
||||
>>> standard_way = sorted(data, key=itemgetter(0), reverse=True)
|
||||
>>> double_reversed = list(reversed(sorted(reversed(data), key=itemgetter(0))))
|
||||
>>> assert standard_way == double_reversed
|
||||
>>> standard_way
|
||||
[('red', 1), ('red', 2), ('blue', 1), ('blue', 2)]
|
||||
|
||||
* To create a standard sort order for a class, just add the appropriate rich
|
||||
comparison methods:
|
||||
|
||||
>>> Student.__eq__ = lambda self, other: self.age == other.age
|
||||
>>> Student.__ne__ = lambda self, other: self.age != other.age
|
||||
>>> Student.__lt__ = lambda self, other: self.age < other.age
|
||||
>>> Student.__le__ = lambda self, other: self.age <= other.age
|
||||
>>> Student.__gt__ = lambda self, other: self.age > other.age
|
||||
>>> Student.__ge__ = lambda self, other: self.age >= other.age
|
||||
>>> sorted(student_objects)
|
||||
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
||||
|
||||
For general purpose comparisons, the recommended approach is to define all six
|
||||
rich comparison operators. The :func:`functools.total_ordering` class
|
||||
decorator makes this easy to implement.
|
||||
|
||||
* Key functions need not depend directly on the objects being sorted. A key
|
||||
function can also access external resources. For instance, if the student grades
|
||||
are stored in a dictionary, they can be used to sort a separate list of student
|
||||
names:
|
||||
|
||||
>>> students = ['dave', 'john', 'jane']
|
||||
>>> grades = {'john': 'F', 'jane':'A', 'dave': 'C'}
|
||||
>>> sorted(students, key=grades.__getitem__)
|
||||
['jane', 'dave', 'john']
|
||||
748
Doc/howto/unicode.rst
Normal file
748
Doc/howto/unicode.rst
Normal file
@@ -0,0 +1,748 @@
|
||||
*****************
|
||||
Unicode HOWTO
|
||||
*****************
|
||||
|
||||
:Release: 1.03
|
||||
|
||||
This HOWTO discusses Python 2.x's support for Unicode, and explains
|
||||
various problems that people commonly encounter when trying to work
|
||||
with Unicode. For the Python 3 version, see
|
||||
<https://docs.python.org/3/howto/unicode.html>.
|
||||
|
||||
Introduction to Unicode
|
||||
=======================
|
||||
|
||||
History of Character Codes
|
||||
--------------------------
|
||||
|
||||
In 1968, the American Standard Code for Information Interchange, better known by
|
||||
its acronym ASCII, was standardized. ASCII defined numeric codes for various
|
||||
characters, with the numeric values running from 0 to
|
||||
127. For example, the lowercase letter 'a' is assigned 97 as its code
|
||||
value.
|
||||
|
||||
ASCII was an American-developed standard, so it only defined unaccented
|
||||
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
|
||||
which required accented characters couldn't be faithfully represented in ASCII.
|
||||
(Actually the missing accents matter for English, too, which contains words such
|
||||
as 'naïve' and 'café', and some publications have house styles which require
|
||||
spellings such as 'coöperate'.)
|
||||
|
||||
For a while people just wrote programs that didn't display accents. I remember
|
||||
looking at Apple ][ BASIC programs, published in French-language publications in
|
||||
the mid-1980s, that had lines like these::
|
||||
|
||||
PRINT "MISE A JOUR TERMINEE"
|
||||
PRINT "PARAMETRES ENREGISTRES"
|
||||
|
||||
Those messages should contain accents, and they just look wrong to someone who
|
||||
can read French.
|
||||
|
||||
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
|
||||
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
|
||||
machines assigned values between 128 and 255 to accented characters. Different
|
||||
machines had different codes, however, which led to problems exchanging files.
|
||||
Eventually various commonly used sets of values for the 128--255 range emerged.
|
||||
Some were true standards, defined by the International Organization for
|
||||
Standardization, and some were *de facto* conventions that were invented by one
|
||||
company or another and managed to catch on.
|
||||
|
||||
255 characters aren't very many. For example, you can't fit both the accented
|
||||
characters used in Western Europe and the Cyrillic alphabet used for Russian
|
||||
into the 128--255 range because there are more than 128 such characters.
|
||||
|
||||
You could write files using different codes (all your Russian files in a coding
|
||||
system called KOI8, all your French files in a different coding system called
|
||||
Latin1), but what if you wanted to write a French document that quotes some
|
||||
Russian text? In the 1980s people began to want to solve this problem, and the
|
||||
Unicode standardization effort began.
|
||||
|
||||
Unicode started out using 16-bit characters instead of 8-bit characters. 16
|
||||
bits means you have 2^16 = 65,536 distinct values available, making it possible
|
||||
to represent many different characters from many different alphabets; an initial
|
||||
goal was to have Unicode contain the alphabets for every single human language.
|
||||
It turns out that even 16 bits isn't enough to meet that goal, and the modern
|
||||
Unicode specification uses a wider range of codes, 0--1,114,111 (0x10ffff in
|
||||
base-16).
|
||||
|
||||
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
|
||||
originally separate efforts, but the specifications were merged with the 1.1
|
||||
revision of Unicode.
|
||||
|
||||
(This discussion of Unicode's history is highly simplified. I don't think the
|
||||
average Python programmer needs to worry about the historical details; consult
|
||||
the Unicode consortium site listed in the References for more information.)
|
||||
|
||||
|
||||
Definitions
|
||||
-----------
|
||||
|
||||
A **character** is the smallest possible component of a text. 'A', 'B', 'C',
|
||||
etc., are all different characters. So are 'È' and 'Í'. Characters are
|
||||
abstractions, and vary depending on the language or context you're talking
|
||||
about. For example, the symbol for ohms (Ω) is usually drawn much like the
|
||||
capital letter omega (Ω) in the Greek alphabet (they may even be the same in
|
||||
some fonts), but these are two different characters that have different
|
||||
meanings.
|
||||
|
||||
The Unicode standard describes how characters are represented by **code
|
||||
points**. A code point is an integer value, usually denoted in base 16. In the
|
||||
standard, a code point is written using the notation U+12ca to mean the
|
||||
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
|
||||
of tables listing characters and their corresponding code points::
|
||||
|
||||
0061 'a'; LATIN SMALL LETTER A
|
||||
0062 'b'; LATIN SMALL LETTER B
|
||||
0063 'c'; LATIN SMALL LETTER C
|
||||
...
|
||||
007B '{'; LEFT CURLY BRACKET
|
||||
|
||||
Strictly, these definitions imply that it's meaningless to say 'this is
|
||||
character U+12ca'. U+12ca is a code point, which represents some particular
|
||||
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
|
||||
informal contexts, this distinction between code points and characters will
|
||||
sometimes be forgotten.
|
||||
|
||||
A character is represented on a screen or on paper by a set of graphical
|
||||
elements that's called a **glyph**. The glyph for an uppercase A, for example,
|
||||
is two diagonal strokes and a horizontal stroke, though the exact details will
|
||||
depend on the font being used. Most Python code doesn't need to worry about
|
||||
glyphs; figuring out the correct glyph to display is generally the job of a GUI
|
||||
toolkit or a terminal's font renderer.
|
||||
|
||||
|
||||
Encodings
|
||||
---------
|
||||
|
||||
To summarize the previous section: a Unicode string is a sequence of code
|
||||
points, which are numbers from 0 to 0x10ffff. This sequence needs to be
|
||||
represented as a set of bytes (meaning, values from 0--255) in memory. The rules
|
||||
for translating a Unicode string into a sequence of bytes are called an
|
||||
**encoding**.
|
||||
|
||||
The first encoding you might think of is an array of 32-bit integers. In this
|
||||
representation, the string "Python" would look like this::
|
||||
|
||||
P y t h o n
|
||||
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
|
||||
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
||||
|
||||
This representation is straightforward but using it presents a number of
|
||||
problems.
|
||||
|
||||
1. It's not portable; different processors order the bytes differently.
|
||||
|
||||
2. It's very wasteful of space. In most texts, the majority of the code points
|
||||
are less than 127, or less than 255, so a lot of space is occupied by zero
|
||||
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
|
||||
ASCII representation. Increased RAM usage doesn't matter too much (desktop
|
||||
computers have megabytes of RAM, and strings aren't usually that large), but
|
||||
expanding our usage of disk and network bandwidth by a factor of 4 is
|
||||
intolerable.
|
||||
|
||||
3. It's not compatible with existing C functions such as ``strlen()``, so a new
|
||||
family of wide string functions would need to be used.
|
||||
|
||||
4. Many Internet standards are defined in terms of textual data, and can't
|
||||
handle content with embedded zero bytes.
|
||||
|
||||
Generally people don't use this encoding, instead choosing other
|
||||
encodings that are more efficient and convenient. UTF-8 is probably
|
||||
the most commonly supported encoding; it will be discussed below.
|
||||
|
||||
Encodings don't have to handle every possible Unicode character, and most
|
||||
encodings don't. For example, Python's default encoding is the 'ascii'
|
||||
encoding. The rules for converting a Unicode string into the ASCII encoding are
|
||||
simple; for each code point:
|
||||
|
||||
1. If the code point is < 128, each byte is the same as the value of the code
|
||||
point.
|
||||
|
||||
2. If the code point is 128 or greater, the Unicode string can't be represented
|
||||
in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
|
||||
case.)
|
||||
|
||||
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
|
||||
0--255 are identical to the Latin-1 values, so converting to this encoding simply
|
||||
requires converting code points to byte values; if a code point larger than 255
|
||||
is encountered, the string can't be encoded into Latin-1.
|
||||
|
||||
Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
|
||||
IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
|
||||
block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
|
||||
through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
|
||||
some sort of lookup table to perform the conversion, but this is largely an
|
||||
internal detail.
|
||||
|
||||
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
|
||||
Transformation Format", and the '8' means that 8-bit numbers are used in the
|
||||
encoding. (There's also a UTF-16 encoding, but it's less frequently used than
|
||||
UTF-8.) UTF-8 uses the following rules:
|
||||
|
||||
1. If the code point is <128, it's represented by the corresponding byte value.
|
||||
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
|
||||
between 128 and 255.
|
||||
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
|
||||
byte of the sequence is between 128 and 255.
|
||||
|
||||
UTF-8 has several convenient properties:
|
||||
|
||||
1. It can handle any Unicode code point.
|
||||
2. A Unicode string is turned into a string of bytes containing no embedded zero
|
||||
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
|
||||
processed by C functions such as ``strcpy()`` and sent through protocols that
|
||||
can't handle zero bytes.
|
||||
3. A string of ASCII text is also valid UTF-8 text.
|
||||
4. UTF-8 is fairly compact; the majority of code points are turned into two
|
||||
bytes, and values less than 128 occupy only a single byte.
|
||||
5. If bytes are corrupted or lost, it's possible to determine the start of the
|
||||
next UTF-8-encoded code point and resynchronize. It's also unlikely that
|
||||
random 8-bit data will look like valid UTF-8.
|
||||
|
||||
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
The Unicode Consortium site at <http://www.unicode.org> has character charts, a
|
||||
glossary, and PDF versions of the Unicode specification. Be prepared for some
|
||||
difficult reading. <http://www.unicode.org/history/> is a chronology of the
|
||||
origin and development of Unicode.
|
||||
|
||||
To help understand the standard, Jukka Korpela has written an introductory guide
|
||||
to reading the Unicode character tables, available at
|
||||
<https://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
|
||||
|
||||
Another good introductory article was written by Joel Spolsky
|
||||
<http://www.joelonsoftware.com/articles/Unicode.html>.
|
||||
If this introduction didn't make things clear to you, you should try reading this
|
||||
alternate article before continuing.
|
||||
|
||||
.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
|
||||
|
||||
Wikipedia entries are often helpful; see the entries for "character encoding"
|
||||
<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
|
||||
<http://en.wikipedia.org/wiki/UTF-8>, for example.
|
||||
|
||||
|
||||
Python 2.x's Unicode Support
|
||||
============================
|
||||
|
||||
Now that you've learned the rudiments of Unicode, we can look at Python's
|
||||
Unicode features.
|
||||
|
||||
|
||||
The Unicode Type
|
||||
----------------
|
||||
|
||||
Unicode strings are expressed as instances of the :class:`unicode` type, one of
|
||||
Python's repertoire of built-in types. It derives from an abstract type called
|
||||
:class:`basestring`, which is also an ancestor of the :class:`str` type; you can
|
||||
therefore check if a value is a string type with ``isinstance(value,
|
||||
basestring)``. Under the hood, Python represents Unicode strings as either 16-
|
||||
or 32-bit integers, depending on how the Python interpreter was compiled.
|
||||
|
||||
The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
|
||||
errors])``. All of its arguments should be 8-bit strings. The first argument
|
||||
is converted to Unicode using the specified encoding; if you leave off the
|
||||
``encoding`` argument, the ASCII encoding is used for the conversion, so
|
||||
characters greater than 127 will be treated as errors::
|
||||
|
||||
>>> unicode('abcdef')
|
||||
u'abcdef'
|
||||
>>> s = unicode('abcdef')
|
||||
>>> type(s)
|
||||
<type 'unicode'>
|
||||
>>> unicode('abcdef' + chr(255)) #doctest: +NORMALIZE_WHITESPACE
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
|
||||
ordinal not in range(128)
|
||||
|
||||
The ``errors`` argument specifies the response when the input string can't be
|
||||
converted according to the encoding's rules. Legal values for this argument are
|
||||
'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
|
||||
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
|
||||
Unicode result). The following examples show the differences::
|
||||
|
||||
>>> unicode('\x80abc', errors='strict') #doctest: +NORMALIZE_WHITESPACE
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
|
||||
ordinal not in range(128)
|
||||
>>> unicode('\x80abc', errors='replace')
|
||||
u'\ufffdabc'
|
||||
>>> unicode('\x80abc', errors='ignore')
|
||||
u'abc'
|
||||
|
||||
Encodings are specified as strings containing the encoding's name. Python 2.7
|
||||
comes with roughly 100 different encodings; see the Python Library Reference at
|
||||
:ref:`standard-encodings` for a list. Some encodings
|
||||
have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
|
||||
synonyms for the same encoding.
|
||||
|
||||
One-character Unicode strings can also be created with the :func:`unichr`
|
||||
built-in function, which takes integers and returns a Unicode string of length 1
|
||||
that contains the corresponding code point. The reverse operation is the
|
||||
built-in :func:`ord` function that takes a one-character Unicode string and
|
||||
returns the code point value::
|
||||
|
||||
>>> unichr(40960)
|
||||
u'\ua000'
|
||||
>>> ord(u'\ua000')
|
||||
40960
|
||||
|
||||
Instances of the :class:`unicode` type have many of the same methods as the
|
||||
8-bit string type for operations such as searching and formatting::
|
||||
|
||||
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
|
||||
>>> s.count('e')
|
||||
5
|
||||
>>> s.find('feather')
|
||||
9
|
||||
>>> s.find('bird')
|
||||
-1
|
||||
>>> s.replace('feather', 'sand')
|
||||
u'Was ever sand so lightly blown to and fro as this multitude?'
|
||||
>>> s.upper()
|
||||
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
|
||||
|
||||
Note that the arguments to these methods can be Unicode strings or 8-bit
|
||||
strings. 8-bit strings will be converted to Unicode before carrying out the
|
||||
operation; Python's default ASCII encoding will be used, so characters greater
|
||||
than 127 will cause an exception::
|
||||
|
||||
>>> s.find('Was\x9f') #doctest: +NORMALIZE_WHITESPACE
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3:
|
||||
ordinal not in range(128)
|
||||
>>> s.find(u'Was\x9f')
|
||||
-1
|
||||
|
||||
Much Python code that operates on strings will therefore work with Unicode
|
||||
strings without requiring any changes to the code. (Input and output code needs
|
||||
more updating for Unicode; more on this later.)
|
||||
|
||||
Another important method is ``.encode([encoding], [errors='strict'])``, which
|
||||
returns an 8-bit string version of the Unicode string, encoded in the requested
|
||||
encoding. The ``errors`` parameter is the same as the parameter of the
|
||||
``unicode()`` constructor, with one additional possibility; as well as 'strict',
|
||||
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
|
||||
character references. The following example shows the different results::
|
||||
|
||||
>>> u = unichr(40960) + u'abcd' + unichr(1972)
|
||||
>>> u.encode('utf-8')
|
||||
'\xea\x80\x80abcd\xde\xb4'
|
||||
>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in
|
||||
position 0: ordinal not in range(128)
|
||||
>>> u.encode('ascii', 'ignore')
|
||||
'abcd'
|
||||
>>> u.encode('ascii', 'replace')
|
||||
'?abcd?'
|
||||
>>> u.encode('ascii', 'xmlcharrefreplace')
|
||||
'ꀀabcd޴'
|
||||
|
||||
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
|
||||
interprets the string using the given encoding::
|
||||
|
||||
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
|
||||
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
|
||||
>>> type(utf8_version), utf8_version
|
||||
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
|
||||
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
|
||||
>>> u == u2 # The two strings match
|
||||
True
|
||||
|
||||
The low-level routines for registering and accessing the available encodings are
|
||||
found in the :mod:`codecs` module. However, the encoding and decoding functions
|
||||
returned by this module are usually more low-level than is comfortable, so I'm
|
||||
not going to describe the :mod:`codecs` module here. If you need to implement a
|
||||
completely new encoding, you'll need to learn about the :mod:`codecs` module
|
||||
interfaces, but implementing encodings is a specialized task that also won't be
|
||||
covered here. Consult the Python documentation to learn more about this module.
|
||||
|
||||
The most commonly used part of the :mod:`codecs` module is the
|
||||
:func:`codecs.open` function which will be discussed in the section on input and
|
||||
output.
|
||||
|
||||
|
||||
Unicode Literals in Python Source Code
|
||||
--------------------------------------
|
||||
|
||||
In Python source code, Unicode literals are written as strings prefixed with the
|
||||
'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written
|
||||
using the ``\u`` escape sequence, which is followed by four hex digits giving
|
||||
the code point. The ``\U`` escape sequence is similar, but expects 8 hex
|
||||
digits, not 4.
|
||||
|
||||
Unicode literals can also use the same escape sequences as 8-bit strings,
|
||||
including ``\x``, but ``\x`` only takes two hex digits so it can't express an
|
||||
arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
|
||||
|
||||
::
|
||||
|
||||
>>> s = u"a\xac\u1234\u20ac\U00008000"
|
||||
... # ^^^^ two-digit hex escape
|
||||
... # ^^^^^^ four-digit Unicode escape
|
||||
... # ^^^^^^^^^^ eight-digit Unicode escape
|
||||
>>> for c in s: print ord(c),
|
||||
...
|
||||
97 172 4660 8364 32768
|
||||
|
||||
Using escape sequences for code points greater than 127 is fine in small doses,
|
||||
but becomes an annoyance if you're using many accented characters, as you would
|
||||
in a program with messages in French or some other accent-using language. You
|
||||
can also assemble strings using the :func:`unichr` built-in function, but this is
|
||||
even more tedious.
|
||||
|
||||
Ideally, you'd want to be able to write literals in your language's natural
|
||||
encoding. You could then edit Python source code with your favorite editor
|
||||
which would display the accented characters naturally, and have the right
|
||||
characters used at runtime.
|
||||
|
||||
Python supports writing Unicode literals in any encoding, but you have to
|
||||
declare the encoding being used. This is done by including a special comment as
|
||||
either the first or second line of the source file::
|
||||
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: latin-1 -*-
|
||||
|
||||
u = u'abcdé'
|
||||
print ord(u[-1])
|
||||
|
||||
The syntax is inspired by Emacs's notation for specifying variables local to a
|
||||
file. Emacs supports many different variables, but Python only supports
|
||||
'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
|
||||
they have no significance to Python but are a convention. Python looks for
|
||||
``coding: name`` or ``coding=name`` in the comment.
|
||||
|
||||
If you don't include such a comment, the default encoding used will be ASCII.
|
||||
Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
|
||||
encoding for string literals; in Python 2.4, characters greater than 127 still
|
||||
work but result in a warning. For example, the following program has no
|
||||
encoding declaration::
|
||||
|
||||
#!/usr/bin/env python
|
||||
u = u'abcdé'
|
||||
print ord(u[-1])
|
||||
|
||||
When you run it with Python 2.4, it will output the following warning::
|
||||
|
||||
amk:~$ python2.4 p263.py
|
||||
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
|
||||
in file p263.py on line 2, but no encoding declared;
|
||||
see https://www.python.org/peps/pep-0263.html for details
|
||||
|
||||
Python 2.5 and higher are stricter and will produce a syntax error::
|
||||
|
||||
amk:~$ python2.5 p263.py
|
||||
File "/tmp/p263.py", line 2
|
||||
SyntaxError: Non-ASCII character '\xc3' in file /tmp/p263.py
|
||||
on line 2, but no encoding declared; see
|
||||
https://www.python.org/peps/pep-0263.html for details
|
||||
|
||||
|
||||
Unicode Properties
|
||||
------------------
|
||||
|
||||
The Unicode specification includes a database of information about code points.
|
||||
For each code point that's defined, the information includes the character's
|
||||
name, its category, the numeric value if applicable (Unicode has characters
|
||||
representing the Roman numerals and fractions such as one-third and
|
||||
four-fifths). There are also properties related to the code point's use in
|
||||
bidirectional text and other display-related properties.
|
||||
|
||||
The following program displays some information about several characters, and
|
||||
prints the numeric value of one particular character::
|
||||
|
||||
import unicodedata
|
||||
|
||||
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
|
||||
|
||||
for i, c in enumerate(u):
|
||||
print i, '%04x' % ord(c), unicodedata.category(c),
|
||||
print unicodedata.name(c)
|
||||
|
||||
# Get numeric value of second character
|
||||
print unicodedata.numeric(u[1])
|
||||
|
||||
When run, this prints::
|
||||
|
||||
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
|
||||
1 0bf2 No TAMIL NUMBER ONE THOUSAND
|
||||
2 0f84 Mn TIBETAN MARK HALANTA
|
||||
3 1770 Lo TAGBANWA LETTER SA
|
||||
4 33af So SQUARE RAD OVER S SQUARED
|
||||
1000.0
|
||||
|
||||
The category codes are abbreviations describing the nature of the character.
|
||||
These are grouped into categories such as "Letter", "Number", "Punctuation", or
|
||||
"Symbol", which in turn are broken up into subcategories. To take the codes
|
||||
from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
|
||||
"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
|
||||
other". See
|
||||
<http://www.unicode.org/reports/tr44/#General_Category_Values> for a
|
||||
list of category codes.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
The Unicode and 8-bit string types are described in the Python library reference
|
||||
at :ref:`typesseq`.
|
||||
|
||||
The documentation for the :mod:`unicodedata` module.
|
||||
|
||||
The documentation for the :mod:`codecs` module.
|
||||
|
||||
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
|
||||
Unicode". A PDF version of his slides is available at
|
||||
<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
|
||||
excellent overview of the design of Python's Unicode features.
|
||||
|
||||
|
||||
Reading and Writing Unicode Data
|
||||
================================
|
||||
|
||||
Once you've written some code that works with Unicode data, the next problem is
|
||||
input/output. How do you get Unicode strings into your program, and how do you
|
||||
convert Unicode into a form suitable for storage or transmission?
|
||||
|
||||
It's possible that you may not need to do anything depending on your input
|
||||
sources and output destinations; you should check whether the libraries used in
|
||||
your application support Unicode natively. XML parsers often return Unicode
|
||||
data, for example. Many relational databases also support Unicode-valued
|
||||
columns and can return Unicode values from an SQL query.
|
||||
|
||||
Unicode data is usually converted to a particular encoding before it gets
|
||||
written to disk or sent over a socket. It's possible to do all the work
|
||||
yourself: open a file, read an 8-bit string from it, and convert the string with
|
||||
``unicode(str, encoding)``. However, the manual approach is not recommended.
|
||||
|
||||
One problem is the multi-byte nature of encodings; one Unicode character can be
|
||||
represented by several bytes. If you want to read the file in arbitrary-sized
|
||||
chunks (say, 1K or 4K), you need to write error-handling code to catch the case
|
||||
where only part of the bytes encoding a single Unicode character are read at the
|
||||
end of a chunk. One solution would be to read the entire file into memory and
|
||||
then perform the decoding, but that prevents you from working with files that
|
||||
are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
|
||||
(More, really, since for at least a moment you'd need to have both the encoded
|
||||
string and its Unicode version in memory.)
|
||||
|
||||
The solution would be to use the low-level decoding interface to catch the case
|
||||
of partial coding sequences. The work of implementing this has already been
|
||||
done for you: the :mod:`codecs` module includes a version of the :func:`open`
|
||||
function that returns a file-like object that assumes the file's contents are in
|
||||
a specified encoding and accepts Unicode parameters for methods such as
|
||||
``.read()`` and ``.write()``.
|
||||
|
||||
The function's parameters are ``open(filename, mode='rb', encoding=None,
|
||||
errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
|
||||
just like the corresponding parameter to the regular built-in ``open()``
|
||||
function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
|
||||
to the standard function's parameter. ``encoding`` is a string giving the
|
||||
encoding to use; if it's left as ``None``, a regular Python file object that
|
||||
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
|
||||
data written to or read from the wrapper object will be converted as needed.
|
||||
``errors`` specifies the action for encoding errors and can be one of the usual
|
||||
values of 'strict', 'ignore', and 'replace'.
|
||||
|
||||
Reading Unicode from a file is therefore simple::
|
||||
|
||||
import codecs
|
||||
f = codecs.open('unicode.rst', encoding='utf-8')
|
||||
for line in f:
|
||||
print repr(line)
|
||||
|
||||
It's also possible to open files in update mode, allowing both reading and
|
||||
writing::
|
||||
|
||||
f = codecs.open('test', encoding='utf-8', mode='w+')
|
||||
f.write(u'\u4500 blah blah blah\n')
|
||||
f.seek(0)
|
||||
print repr(f.readline()[:1])
|
||||
f.close()
|
||||
|
||||
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
|
||||
written as the first character of a file in order to assist with autodetection
|
||||
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
|
||||
present at the start of a file; when such an encoding is used, the BOM will be
|
||||
automatically written as the first character and will be silently dropped when
|
||||
the file is read. There are variants of these encodings, such as 'utf-16-le'
|
||||
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
|
||||
particular byte ordering and don't skip the BOM.
|
||||
|
||||
|
||||
Unicode filenames
|
||||
-----------------
|
||||
|
||||
Most of the operating systems in common use today support filenames that contain
|
||||
arbitrary Unicode characters. Usually this is implemented by converting the
|
||||
Unicode string into some encoding that varies depending on the system. For
|
||||
example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
|
||||
Windows, Python uses the name "mbcs" to refer to whatever the currently
|
||||
configured encoding is. On Unix systems, there will only be a filesystem
|
||||
encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
|
||||
you haven't, the default encoding is ASCII.
|
||||
|
||||
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
|
||||
your current system, in case you want to do the encoding manually, but there's
|
||||
not much reason to bother. When opening a file for reading or writing, you can
|
||||
usually just provide the Unicode string as the filename, and it will be
|
||||
automatically converted to the right encoding for you::
|
||||
|
||||
filename = u'filename\u4500abc'
|
||||
f = open(filename, 'w')
|
||||
f.write('blah\n')
|
||||
f.close()
|
||||
|
||||
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
|
||||
filenames.
|
||||
|
||||
:func:`os.listdir`, which returns filenames, raises an issue: should it return
|
||||
the Unicode version of filenames, or should it return 8-bit strings containing
|
||||
the encoded versions? :func:`os.listdir` will do both, depending on whether you
|
||||
provided the directory path as an 8-bit string or a Unicode string. If you pass
|
||||
a Unicode string as the path, filenames will be decoded using the filesystem's
|
||||
encoding and a list of Unicode strings will be returned, while passing an 8-bit
|
||||
path will return the 8-bit versions of the filenames. For example, assuming the
|
||||
default filesystem encoding is UTF-8, running the following program::
|
||||
|
||||
fn = u'filename\u4500abc'
|
||||
f = open(fn, 'w')
|
||||
f.close()
|
||||
|
||||
import os
|
||||
print os.listdir('.')
|
||||
print os.listdir(u'.')
|
||||
|
||||
will produce the following output:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
amk:~$ python t.py
|
||||
['.svn', 'filename\xe4\x94\x80abc', ...]
|
||||
[u'.svn', u'filename\u4500abc', ...]
|
||||
|
||||
The first list contains UTF-8-encoded filenames, and the second list contains
|
||||
the Unicode versions.
|
||||
|
||||
|
||||
|
||||
Tips for Writing Unicode-aware Programs
|
||||
---------------------------------------
|
||||
|
||||
This section provides some suggestions on writing software that deals with
|
||||
Unicode.
|
||||
|
||||
The most important tip is:
|
||||
|
||||
Software should only work with Unicode strings internally, converting to a
|
||||
particular encoding on output.
|
||||
|
||||
If you attempt to write processing functions that accept both Unicode and 8-bit
|
||||
strings, you will find your program vulnerable to bugs wherever you combine the
|
||||
two different kinds of strings. Python's default encoding is ASCII, so whenever
|
||||
a character with an ASCII value > 127 is in the input data, you'll get a
|
||||
:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
|
||||
encoding.
|
||||
|
||||
It's easy to miss such problems if you only test your software with data that
|
||||
doesn't contain any accents; everything will seem to work, but there's actually
|
||||
a bug in your program waiting for the first user who attempts to use characters
|
||||
> 127. A second tip, therefore, is:
|
||||
|
||||
Include characters > 127 and, even better, characters > 255 in your test
|
||||
data.
|
||||
|
||||
When using data coming from a web browser or some other untrusted source, a
|
||||
common technique is to check for illegal characters in a string before using the
|
||||
string in a generated command line or storing it in a database. If you're doing
|
||||
this, be careful to check the string once it's in the form that will be used or
|
||||
stored; it's possible for encodings to be used to disguise characters. This is
|
||||
especially true if the input data also specifies the encoding; many encodings
|
||||
leave the commonly checked-for characters alone, but Python includes some
|
||||
encodings such as ``'base64'`` that modify every single character.
|
||||
|
||||
For example, let's say you have a content management system that takes a Unicode
|
||||
filename, and you want to disallow paths with a '/' character. You might write
|
||||
this code::
|
||||
|
||||
def read_file (filename, encoding):
|
||||
if '/' in filename:
|
||||
raise ValueError("'/' not allowed in filenames")
|
||||
unicode_name = filename.decode(encoding)
|
||||
f = open(unicode_name, 'r')
|
||||
# ... return contents of file ...
|
||||
|
||||
However, if an attacker could specify the ``'base64'`` encoding, they could pass
|
||||
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
|
||||
``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
|
||||
characters in the encoded form and misses the dangerous character in the
|
||||
resulting decoded form.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
|
||||
Applications in Python" are available at
|
||||
<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
|
||||
and discuss questions of character encodings as well as how to internationalize
|
||||
and localize an application.
|
||||
|
||||
|
||||
Revision History and Acknowledgements
|
||||
=====================================
|
||||
|
||||
Thanks to the following people who have noted errors or offered suggestions on
|
||||
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
|
||||
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
|
||||
|
||||
Version 1.0: posted August 5 2005.
|
||||
|
||||
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
|
||||
several links.
|
||||
|
||||
Version 1.02: posted August 16 2005. Corrects factual errors.
|
||||
|
||||
Version 1.03: posted June 20 2010. Notes that Python 3.x is not covered,
|
||||
and that the HOWTO only covers 2.x.
|
||||
|
||||
|
||||
.. comment Describe Python 3.x support (new section? new document?)
|
||||
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
||||
.. comment Describe obscure -U switch somewhere?
|
||||
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
|
||||
|
||||
.. comment
|
||||
Original outline:
|
||||
|
||||
- [ ] Unicode introduction
|
||||
- [ ] ASCII
|
||||
- [ ] Terms
|
||||
- [ ] Character
|
||||
- [ ] Code point
|
||||
- [ ] Encodings
|
||||
- [ ] Common encodings: ASCII, Latin-1, UTF-8
|
||||
- [ ] Unicode Python type
|
||||
- [ ] Writing unicode literals
|
||||
- [ ] Obscurity: -U switch
|
||||
- [ ] Built-ins
|
||||
- [ ] unichr()
|
||||
- [ ] ord()
|
||||
- [ ] unicode() constructor
|
||||
- [ ] Unicode type
|
||||
- [ ] encode(), decode() methods
|
||||
- [ ] Unicodedata module for character properties
|
||||
- [ ] I/O
|
||||
- [ ] Reading/writing Unicode data into files
|
||||
- [ ] Byte-order marks
|
||||
- [ ] Unicode filenames
|
||||
- [ ] Writing Unicode programs
|
||||
- [ ] Do everything in Unicode
|
||||
- [ ] Declaring source code encodings (PEP 263)
|
||||
- [ ] Other issues
|
||||
- [ ] Building Python (UCS2, UCS4)
|
||||
584
Doc/howto/urllib2.rst
Normal file
584
Doc/howto/urllib2.rst
Normal file
@@ -0,0 +1,584 @@
|
||||
.. _urllib-howto:
|
||||
|
||||
************************************************
|
||||
HOWTO Fetch Internet Resources Using urllib2
|
||||
************************************************
|
||||
|
||||
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
|
||||
|
||||
.. note::
|
||||
|
||||
There is a French translation of an earlier revision of this
|
||||
HOWTO, available at `urllib2 - Le Manuel manquant
|
||||
<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
|
||||
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
.. sidebar:: Related Articles
|
||||
|
||||
You may also find useful the following article on fetching web resources
|
||||
with Python:
|
||||
|
||||
* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
|
||||
|
||||
A tutorial on *Basic Authentication*, with examples in Python.
|
||||
|
||||
**urllib2** is a Python module for fetching URLs
|
||||
(Uniform Resource Locators). It offers a very simple interface, in the form of
|
||||
the *urlopen* function. This is capable of fetching URLs using a variety of
|
||||
different protocols. It also offers a slightly more complex interface for
|
||||
handling common situations - like basic authentication, cookies, proxies and so
|
||||
on. These are provided by objects called handlers and openers.
|
||||
|
||||
urllib2 supports fetching URLs for many "URL schemes" (identified by the string
|
||||
before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
|
||||
``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
|
||||
This tutorial focuses on the most common case, HTTP.
|
||||
|
||||
For straightforward situations *urlopen* is very easy to use. But as soon as you
|
||||
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
|
||||
understanding of the HyperText Transfer Protocol. The most comprehensive and
|
||||
authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
|
||||
not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
|
||||
with enough detail about HTTP to help you through. It is not intended to replace
|
||||
the :mod:`urllib2` docs, but is supplementary to them.
|
||||
|
||||
|
||||
Fetching URLs
|
||||
=============
|
||||
|
||||
The simplest way to use urllib2 is as follows::
|
||||
|
||||
import urllib2
|
||||
response = urllib2.urlopen('http://python.org/')
|
||||
html = response.read()
|
||||
|
||||
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
|
||||
could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the
|
||||
purpose of this tutorial to explain the more complicated cases, concentrating on
|
||||
HTTP.
|
||||
|
||||
HTTP is based on requests and responses - the client makes requests and servers
|
||||
send responses. urllib2 mirrors this with a ``Request`` object which represents
|
||||
the HTTP request you are making. In its simplest form you create a Request
|
||||
object that specifies the URL you want to fetch. Calling ``urlopen`` with this
|
||||
Request object returns a response object for the URL requested. This response is
|
||||
a file-like object, which means you can for example call ``.read()`` on the
|
||||
response::
|
||||
|
||||
import urllib2
|
||||
|
||||
req = urllib2.Request('http://www.voidspace.org.uk')
|
||||
response = urllib2.urlopen(req)
|
||||
the_page = response.read()
|
||||
|
||||
Note that urllib2 makes use of the same Request interface to handle all URL
|
||||
schemes. For example, you can make an FTP request like so::
|
||||
|
||||
req = urllib2.Request('ftp://example.com/')
|
||||
|
||||
In the case of HTTP, there are two extra things that Request objects allow you
|
||||
to do: First, you can pass data to be sent to the server. Second, you can pass
|
||||
extra information ("metadata") *about* the data or the about request itself, to
|
||||
the server - this information is sent as HTTP "headers". Let's look at each of
|
||||
these in turn.
|
||||
|
||||
Data
|
||||
----
|
||||
|
||||
Sometimes you want to send data to a URL (often the URL will refer to a CGI
|
||||
(Common Gateway Interface) script [#]_ or other web application). With HTTP,
|
||||
this is often done using what's known as a **POST** request. This is often what
|
||||
your browser does when you submit a HTML form that you filled in on the web. Not
|
||||
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
|
||||
to your own application. In the common case of HTML forms, the data needs to be
|
||||
encoded in a standard way, and then passed to the Request object as the ``data``
|
||||
argument. The encoding is done using a function from the ``urllib`` library
|
||||
*not* from ``urllib2``. ::
|
||||
|
||||
import urllib
|
||||
import urllib2
|
||||
|
||||
url = 'http://www.someserver.com/cgi-bin/register.cgi'
|
||||
values = {'name' : 'Michael Foord',
|
||||
'location' : 'Northampton',
|
||||
'language' : 'Python' }
|
||||
|
||||
data = urllib.urlencode(values)
|
||||
req = urllib2.Request(url, data)
|
||||
response = urllib2.urlopen(req)
|
||||
the_page = response.read()
|
||||
|
||||
Note that other encodings are sometimes required (e.g. for file upload from HTML
|
||||
forms - see `HTML Specification, Form Submission
|
||||
<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
|
||||
details).
|
||||
|
||||
If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
|
||||
way in which GET and POST requests differ is that POST requests often have
|
||||
"side-effects": they change the state of the system in some way (for example by
|
||||
placing an order with the website for a hundredweight of tinned spam to be
|
||||
delivered to your door). Though the HTTP standard makes it clear that POSTs are
|
||||
intended to *always* cause side-effects, and GET requests *never* to cause
|
||||
side-effects, nothing prevents a GET request from having side-effects, nor a
|
||||
POST requests from having no side-effects. Data can also be passed in an HTTP
|
||||
GET request by encoding it in the URL itself.
|
||||
|
||||
This is done as follows::
|
||||
|
||||
>>> import urllib2
|
||||
>>> import urllib
|
||||
>>> data = {}
|
||||
>>> data['name'] = 'Somebody Here'
|
||||
>>> data['location'] = 'Northampton'
|
||||
>>> data['language'] = 'Python'
|
||||
>>> url_values = urllib.urlencode(data)
|
||||
>>> print url_values # The order may differ. #doctest: +SKIP
|
||||
name=Somebody+Here&language=Python&location=Northampton
|
||||
>>> url = 'http://www.example.com/example.cgi'
|
||||
>>> full_url = url + '?' + url_values
|
||||
>>> data = urllib2.urlopen(full_url)
|
||||
|
||||
Notice that the full URL is created by adding a ``?`` to the URL, followed by
|
||||
the encoded values.
|
||||
|
||||
Headers
|
||||
-------
|
||||
|
||||
We'll discuss here one particular HTTP header, to illustrate how to add headers
|
||||
to your HTTP request.
|
||||
|
||||
Some websites [#]_ dislike being browsed by programs, or send different versions
|
||||
to different browsers [#]_. By default urllib2 identifies itself as
|
||||
``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
|
||||
numbers of the Python release,
|
||||
e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
|
||||
not work. The way a browser identifies itself is through the
|
||||
``User-Agent`` header [#]_. When you create a Request object you can
|
||||
pass a dictionary of headers in. The following example makes the same
|
||||
request as above, but identifies itself as a version of Internet
|
||||
Explorer [#]_. ::
|
||||
|
||||
import urllib
|
||||
import urllib2
|
||||
|
||||
url = 'http://www.someserver.com/cgi-bin/register.cgi'
|
||||
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
|
||||
values = {'name': 'Michael Foord',
|
||||
'location': 'Northampton',
|
||||
'language': 'Python' }
|
||||
headers = {'User-Agent': user_agent}
|
||||
|
||||
data = urllib.urlencode(values)
|
||||
req = urllib2.Request(url, data, headers)
|
||||
response = urllib2.urlopen(req)
|
||||
the_page = response.read()
|
||||
|
||||
The response also has two useful methods. See the section on `info and geturl`_
|
||||
which comes after we have a look at what happens when things go wrong.
|
||||
|
||||
|
||||
Handling Exceptions
|
||||
===================
|
||||
|
||||
*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
|
||||
usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
|
||||
:exc:`TypeError` etc. may also be raised).
|
||||
|
||||
:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
|
||||
HTTP URLs.
|
||||
|
||||
URLError
|
||||
--------
|
||||
|
||||
Often, URLError is raised because there is no network connection (no route to
|
||||
the specified server), or the specified server doesn't exist. In this case, the
|
||||
exception raised will have a 'reason' attribute, which is a tuple containing an
|
||||
error code and a text error message.
|
||||
|
||||
e.g. ::
|
||||
|
||||
>>> req = urllib2.Request('http://www.pretend_server.org')
|
||||
>>> try: urllib2.urlopen(req)
|
||||
... except urllib2.URLError as e:
|
||||
... print e.reason #doctest: +SKIP
|
||||
...
|
||||
(4, 'getaddrinfo failed')
|
||||
|
||||
|
||||
HTTPError
|
||||
---------
|
||||
|
||||
Every HTTP response from the server contains a numeric "status code". Sometimes
|
||||
the status code indicates that the server is unable to fulfil the request. The
|
||||
default handlers will handle some of these responses for you (for example, if
|
||||
the response is a "redirection" that requests the client fetch the document from
|
||||
a different URL, urllib2 will handle that for you). For those it can't handle,
|
||||
urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
|
||||
found), '403' (request forbidden), and '401' (authentication required).
|
||||
|
||||
See section 10 of RFC 2616 for a reference on all the HTTP error codes.
|
||||
|
||||
The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
|
||||
corresponds to the error sent by the server.
|
||||
|
||||
Error Codes
|
||||
~~~~~~~~~~~
|
||||
|
||||
Because the default handlers handle redirects (codes in the 300 range), and
|
||||
codes in the 100--299 range indicate success, you will usually only see error
|
||||
codes in the 400--599 range.
|
||||
|
||||
``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of
|
||||
response codes in that shows all the response codes used by RFC 2616. The
|
||||
dictionary is reproduced here for convenience ::
|
||||
|
||||
# Table mapping response codes to messages; entries have the
|
||||
# form {code: (shortmessage, longmessage)}.
|
||||
responses = {
|
||||
100: ('Continue', 'Request received, please continue'),
|
||||
101: ('Switching Protocols',
|
||||
'Switching to new protocol; obey Upgrade header'),
|
||||
|
||||
200: ('OK', 'Request fulfilled, document follows'),
|
||||
201: ('Created', 'Document created, URL follows'),
|
||||
202: ('Accepted',
|
||||
'Request accepted, processing continues off-line'),
|
||||
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
|
||||
204: ('No Content', 'Request fulfilled, nothing follows'),
|
||||
205: ('Reset Content', 'Clear input form for further input.'),
|
||||
206: ('Partial Content', 'Partial content follows.'),
|
||||
|
||||
300: ('Multiple Choices',
|
||||
'Object has several resources -- see URI list'),
|
||||
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
|
||||
302: ('Found', 'Object moved temporarily -- see URI list'),
|
||||
303: ('See Other', 'Object moved -- see Method and URL list'),
|
||||
304: ('Not Modified',
|
||||
'Document has not changed since given time'),
|
||||
305: ('Use Proxy',
|
||||
'You must use proxy specified in Location to access this '
|
||||
'resource.'),
|
||||
307: ('Temporary Redirect',
|
||||
'Object moved temporarily -- see URI list'),
|
||||
|
||||
400: ('Bad Request',
|
||||
'Bad request syntax or unsupported method'),
|
||||
401: ('Unauthorized',
|
||||
'No permission -- see authorization schemes'),
|
||||
402: ('Payment Required',
|
||||
'No payment -- see charging schemes'),
|
||||
403: ('Forbidden',
|
||||
'Request forbidden -- authorization will not help'),
|
||||
404: ('Not Found', 'Nothing matches the given URI'),
|
||||
405: ('Method Not Allowed',
|
||||
'Specified method is invalid for this server.'),
|
||||
406: ('Not Acceptable', 'URI not available in preferred format.'),
|
||||
407: ('Proxy Authentication Required', 'You must authenticate with '
|
||||
'this proxy before proceeding.'),
|
||||
408: ('Request Timeout', 'Request timed out; try again later.'),
|
||||
409: ('Conflict', 'Request conflict.'),
|
||||
410: ('Gone',
|
||||
'URI no longer exists and has been permanently removed.'),
|
||||
411: ('Length Required', 'Client must specify Content-Length.'),
|
||||
412: ('Precondition Failed', 'Precondition in headers is false.'),
|
||||
413: ('Request Entity Too Large', 'Entity is too large.'),
|
||||
414: ('Request-URI Too Long', 'URI is too long.'),
|
||||
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
|
||||
416: ('Requested Range Not Satisfiable',
|
||||
'Cannot satisfy request range.'),
|
||||
417: ('Expectation Failed',
|
||||
'Expect condition could not be satisfied.'),
|
||||
|
||||
500: ('Internal Server Error', 'Server got itself in trouble'),
|
||||
501: ('Not Implemented',
|
||||
'Server does not support this operation'),
|
||||
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
|
||||
503: ('Service Unavailable',
|
||||
'The server cannot process the request due to a high load'),
|
||||
504: ('Gateway Timeout',
|
||||
'The gateway server did not receive a timely response'),
|
||||
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
|
||||
}
|
||||
|
||||
When an error is raised the server responds by returning an HTTP error code
|
||||
*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
|
||||
page returned. This means that as well as the code attribute, it also has read,
|
||||
geturl, and info, methods. ::
|
||||
|
||||
>>> req = urllib2.Request('http://www.python.org/fish.html')
|
||||
>>> try:
|
||||
... urllib2.urlopen(req)
|
||||
... except urllib2.HTTPError as e:
|
||||
... print e.code
|
||||
... print e.read() #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
|
||||
...
|
||||
404
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||||
...
|
||||
<title>Page Not Found</title>
|
||||
...
|
||||
|
||||
|
||||
Wrapping it Up
|
||||
--------------
|
||||
|
||||
So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
|
||||
basic approaches. I prefer the second approach.
|
||||
|
||||
Number 1
|
||||
~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
|
||||
from urllib2 import Request, urlopen, URLError, HTTPError
|
||||
req = Request(someurl)
|
||||
try:
|
||||
response = urlopen(req)
|
||||
except HTTPError as e:
|
||||
print 'The server couldn\'t fulfill the request.'
|
||||
print 'Error code: ', e.code
|
||||
except URLError as e:
|
||||
print 'We failed to reach a server.'
|
||||
print 'Reason: ', e.reason
|
||||
else:
|
||||
# everything is fine
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
The ``except HTTPError`` *must* come first, otherwise ``except URLError``
|
||||
will *also* catch an :exc:`HTTPError`.
|
||||
|
||||
Number 2
|
||||
~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
from urllib2 import Request, urlopen, URLError
|
||||
req = Request(someurl)
|
||||
try:
|
||||
response = urlopen(req)
|
||||
except URLError as e:
|
||||
if hasattr(e, 'reason'):
|
||||
print 'We failed to reach a server.'
|
||||
print 'Reason: ', e.reason
|
||||
elif hasattr(e, 'code'):
|
||||
print 'The server couldn\'t fulfill the request.'
|
||||
print 'Error code: ', e.code
|
||||
else:
|
||||
# everything is fine
|
||||
|
||||
|
||||
info and geturl
|
||||
===============
|
||||
|
||||
The response returned by urlopen (or the :exc:`HTTPError` instance) has two useful
|
||||
methods :meth:`info` and :meth:`geturl`.
|
||||
|
||||
**geturl** - this returns the real URL of the page fetched. This is useful
|
||||
because ``urlopen`` (or the opener object used) may have followed a
|
||||
redirect. The URL of the page fetched may not be the same as the URL requested.
|
||||
|
||||
**info** - this returns a dictionary-like object that describes the page
|
||||
fetched, particularly the headers sent by the server. It is currently an
|
||||
``httplib.HTTPMessage`` instance.
|
||||
|
||||
Typical headers include 'Content-length', 'Content-type', and so on. See the
|
||||
`Quick Reference to HTTP Headers <https://www.cs.tut.fi/~jkorpela/http.html>`_
|
||||
for a useful listing of HTTP headers with brief explanations of their meaning
|
||||
and use.
|
||||
|
||||
|
||||
Openers and Handlers
|
||||
====================
|
||||
|
||||
When you fetch a URL you use an opener (an instance of the perhaps
|
||||
confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
|
||||
the default opener - via ``urlopen`` - but you can create custom
|
||||
openers. Openers use handlers. All the "heavy lifting" is done by the
|
||||
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
|
||||
ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
|
||||
redirections or HTTP cookies.
|
||||
|
||||
You will want to create openers if you want to fetch URLs with specific handlers
|
||||
installed, for example to get an opener that handles cookies, or to get an
|
||||
opener that does not handle redirections.
|
||||
|
||||
To create an opener, instantiate an ``OpenerDirector``, and then call
|
||||
``.add_handler(some_handler_instance)`` repeatedly.
|
||||
|
||||
Alternatively, you can use ``build_opener``, which is a convenience function for
|
||||
creating opener objects with a single function call. ``build_opener`` adds
|
||||
several handlers by default, but provides a quick way to add more and/or
|
||||
override the default handlers.
|
||||
|
||||
Other sorts of handlers you might want to can handle proxies, authentication,
|
||||
and other common but slightly specialised situations.
|
||||
|
||||
``install_opener`` can be used to make an ``opener`` object the (global) default
|
||||
opener. This means that calls to ``urlopen`` will use the opener you have
|
||||
installed.
|
||||
|
||||
Opener objects have an ``open`` method, which can be called directly to fetch
|
||||
urls in the same way as the ``urlopen`` function: there's no need to call
|
||||
``install_opener``, except as a convenience.
|
||||
|
||||
|
||||
Basic Authentication
|
||||
====================
|
||||
|
||||
To illustrate creating and installing a handler we will use the
|
||||
``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
|
||||
including an explanation of how Basic Authentication works - see the `Basic
|
||||
Authentication Tutorial
|
||||
<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
|
||||
|
||||
When authentication is required, the server sends a header (as well as the 401
|
||||
error code) requesting authentication. This specifies the authentication scheme
|
||||
and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
|
||||
realm="REALM"``.
|
||||
|
||||
e.g. ::
|
||||
|
||||
WWW-Authenticate: Basic realm="cPanel Users"
|
||||
|
||||
|
||||
The client should then retry the request with the appropriate name and password
|
||||
for the realm included as a header in the request. This is 'basic
|
||||
authentication'. In order to simplify this process we can create an instance of
|
||||
``HTTPBasicAuthHandler`` and an opener to use this handler.
|
||||
|
||||
The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
|
||||
the mapping of URLs and realms to passwords and usernames. If you know what the
|
||||
realm is (from the authentication header sent by the server), then you can use a
|
||||
``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
|
||||
case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
|
||||
you to specify a default username and password for a URL. This will be supplied
|
||||
in the absence of you providing an alternative combination for a specific
|
||||
realm. We indicate this by providing ``None`` as the realm argument to the
|
||||
``add_password`` method.
|
||||
|
||||
The top-level URL is the first URL that requires authentication. URLs "deeper"
|
||||
than the URL you pass to .add_password() will also match. ::
|
||||
|
||||
# create a password manager
|
||||
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
|
||||
|
||||
# Add the username and password.
|
||||
# If we knew the realm, we could use it instead of None.
|
||||
top_level_url = "http://example.com/foo/"
|
||||
password_mgr.add_password(None, top_level_url, username, password)
|
||||
|
||||
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
|
||||
|
||||
# create "opener" (OpenerDirector instance)
|
||||
opener = urllib2.build_opener(handler)
|
||||
|
||||
# use the opener to fetch a URL
|
||||
opener.open(a_url)
|
||||
|
||||
# Install the opener.
|
||||
# Now all calls to urllib2.urlopen use our opener.
|
||||
urllib2.install_opener(opener)
|
||||
|
||||
.. note::
|
||||
|
||||
In the above example we only supplied our ``HTTPBasicAuthHandler`` to
|
||||
``build_opener``. By default openers have the handlers for normal situations
|
||||
-- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
|
||||
environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
|
||||
``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
|
||||
``FileHandler``, ``HTTPErrorProcessor``.
|
||||
|
||||
``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
|
||||
component and the hostname and optionally the port number)
|
||||
e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
|
||||
optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
|
||||
(the latter example includes a port number). The authority, if present, must
|
||||
NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
|
||||
not correct.
|
||||
|
||||
|
||||
Proxies
|
||||
=======
|
||||
|
||||
**urllib2** will auto-detect your proxy settings and use those. This is through
|
||||
the ``ProxyHandler``, which is part of the normal handler chain when a proxy
|
||||
setting is detected. Normally that's a good thing, but there are occasions
|
||||
when it may not be helpful [#]_. One way to do this is to setup our own
|
||||
``ProxyHandler``, with no proxies defined. This is done using similar steps to
|
||||
setting up a `Basic Authentication`_ handler: ::
|
||||
|
||||
>>> proxy_support = urllib2.ProxyHandler({})
|
||||
>>> opener = urllib2.build_opener(proxy_support)
|
||||
>>> urllib2.install_opener(opener)
|
||||
|
||||
.. note::
|
||||
|
||||
Currently ``urllib2`` *does not* support fetching of ``https`` locations
|
||||
through a proxy. However, this can be enabled by extending urllib2 as
|
||||
shown in the recipe [#]_.
|
||||
|
||||
.. note::
|
||||
|
||||
``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
|
||||
the documentation on :func:`~urllib.getproxies`.
|
||||
|
||||
|
||||
Sockets and Layers
|
||||
==================
|
||||
|
||||
The Python support for fetching resources from the web is layered. urllib2 uses
|
||||
the httplib library, which in turn uses the socket library.
|
||||
|
||||
As of Python 2.3 you can specify how long a socket should wait for a response
|
||||
before timing out. This can be useful in applications which have to fetch web
|
||||
pages. By default the socket module has *no timeout* and can hang. Currently,
|
||||
the socket timeout is not exposed at the httplib or urllib2 levels. However,
|
||||
you can set the default timeout globally for all sockets using ::
|
||||
|
||||
import socket
|
||||
import urllib2
|
||||
|
||||
# timeout in seconds
|
||||
timeout = 10
|
||||
socket.setdefaulttimeout(timeout)
|
||||
|
||||
# this call to urllib2.urlopen now uses the default timeout
|
||||
# we have set in the socket module
|
||||
req = urllib2.Request('http://www.voidspace.org.uk')
|
||||
response = urllib2.urlopen(req)
|
||||
|
||||
|
||||
-------
|
||||
|
||||
|
||||
Footnotes
|
||||
=========
|
||||
|
||||
This document was reviewed and revised by John Lee.
|
||||
|
||||
.. [#] For an introduction to the CGI protocol see
|
||||
`Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
|
||||
.. [#] Google for example.
|
||||
.. [#] Browser sniffing is a very bad practice for website design - building
|
||||
sites using web standards is much more sensible. Unfortunately a lot of
|
||||
sites still send different versions to different browsers.
|
||||
.. [#] The user agent for MSIE 6 is
|
||||
*'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
|
||||
.. [#] For details of more HTTP request headers, see
|
||||
`Quick Reference to HTTP Headers`_.
|
||||
.. [#] In my case I have to use a proxy to access the internet at work. If you
|
||||
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
|
||||
is set to use the proxy, which urllib2 picks up on. In order to test
|
||||
scripts with a localhost server, I have to prevent urllib2 from using
|
||||
the proxy.
|
||||
.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
||||
<https://code.activestate.com/recipes/456195/>`_.
|
||||
|
||||
735
Doc/howto/webservers.rst
Normal file
735
Doc/howto/webservers.rst
Normal file
@@ -0,0 +1,735 @@
|
||||
*******************************
|
||||
HOWTO Use Python in the web
|
||||
*******************************
|
||||
|
||||
:Author: Marek Kubica
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
This document shows how Python fits into the web. It presents some ways
|
||||
to integrate Python with a web server, and general practices useful for
|
||||
developing web sites.
|
||||
|
||||
|
||||
Programming for the Web has become a hot topic since the rise of "Web 2.0",
|
||||
which focuses on user-generated content on web sites. It has always been
|
||||
possible to use Python for creating web sites, but it was a rather tedious task.
|
||||
Therefore, many frameworks and helper tools have been created to assist
|
||||
developers in creating faster and more robust sites. This HOWTO describes
|
||||
some of the methods used to combine Python with a web server to create
|
||||
dynamic content. It is not meant as a complete introduction, as this topic is
|
||||
far too broad to be covered in one single document. However, a short overview
|
||||
of the most popular libraries is provided.
|
||||
|
||||
.. seealso::
|
||||
|
||||
While this HOWTO tries to give an overview of Python in the web, it cannot
|
||||
always be as up to date as desired. Web development in Python is rapidly
|
||||
moving forward, so the wiki page on `Web Programming
|
||||
<https://wiki.python.org/moin/WebProgramming>`_ may be more in sync with
|
||||
recent development.
|
||||
|
||||
|
||||
The Low-Level View
|
||||
==================
|
||||
|
||||
When a user enters a web site, their browser makes a connection to the site's
|
||||
web server (this is called the *request*). The server looks up the file in the
|
||||
file system and sends it back to the user's browser, which displays it (this is
|
||||
the *response*). This is roughly how the underlying protocol, HTTP, works.
|
||||
|
||||
Dynamic web sites are not based on files in the file system, but rather on
|
||||
programs which are run by the web server when a request comes in, and which
|
||||
*generate* the content that is returned to the user. They can do all sorts of
|
||||
useful things, like display the postings of a bulletin board, show your email,
|
||||
configure software, or just display the current time. These programs can be
|
||||
written in any programming language the server supports. Since most servers
|
||||
support Python, it is easy to use Python to create dynamic web sites.
|
||||
|
||||
Most HTTP servers are written in C or C++, so they cannot execute Python code
|
||||
directly -- a bridge is needed between the server and the program. These
|
||||
bridges, or rather interfaces, define how programs interact with the server.
|
||||
There have been numerous attempts to create the best possible interface, but
|
||||
there are only a few worth mentioning.
|
||||
|
||||
Not every web server supports every interface. Many web servers only support
|
||||
old, now-obsolete interfaces; however, they can often be extended using
|
||||
third-party modules to support newer ones.
|
||||
|
||||
|
||||
Common Gateway Interface
|
||||
------------------------
|
||||
|
||||
This interface, most commonly referred to as "CGI", is the oldest, and is
|
||||
supported by nearly every web server out of the box. Programs using CGI to
|
||||
communicate with their web server need to be started by the server for every
|
||||
request. So, every request starts a new Python interpreter -- which takes some
|
||||
time to start up -- thus making the whole interface only usable for low load
|
||||
situations.
|
||||
|
||||
The upside of CGI is that it is simple -- writing a Python program which uses
|
||||
CGI is a matter of about three lines of code. This simplicity comes at a
|
||||
price: it does very few things to help the developer.
|
||||
|
||||
Writing CGI programs, while still possible, is no longer recommended. With
|
||||
:ref:`WSGI <WSGI>`, a topic covered later in this document, it is possible to write
|
||||
programs that emulate CGI, so they can be run as CGI if no better option is
|
||||
available.
|
||||
|
||||
.. seealso::
|
||||
|
||||
The Python standard library includes some modules that are helpful for
|
||||
creating plain CGI programs:
|
||||
|
||||
* :mod:`cgi` -- Handling of user input in CGI scripts
|
||||
* :mod:`cgitb` -- Displays nice tracebacks when errors happen in CGI
|
||||
applications, instead of presenting a "500 Internal Server Error" message
|
||||
|
||||
The Python wiki features a page on `CGI scripts
|
||||
<https://wiki.python.org/moin/CgiScripts>`_ with some additional information
|
||||
about CGI in Python.
|
||||
|
||||
|
||||
Simple script for testing CGI
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To test whether your web server works with CGI, you can use this short and
|
||||
simple CGI program::
|
||||
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: UTF-8 -*-
|
||||
|
||||
# enable debugging
|
||||
import cgitb
|
||||
cgitb.enable()
|
||||
|
||||
print "Content-Type: text/plain;charset=utf-8"
|
||||
print
|
||||
|
||||
print "Hello World!"
|
||||
|
||||
Depending on your web server configuration, you may need to save this code with
|
||||
a ``.py`` or ``.cgi`` extension. Additionally, this file may also need to be
|
||||
in a ``cgi-bin`` folder, for security reasons.
|
||||
|
||||
You might wonder what the ``cgitb`` line is about. This line makes it possible
|
||||
to display a nice traceback instead of just crashing and displaying an "Internal
|
||||
Server Error" in the user's browser. This is useful for debugging, but it might
|
||||
risk exposing some confidential data to the user. You should not use ``cgitb``
|
||||
in production code for this reason. You should *always* catch exceptions, and
|
||||
display proper error pages -- end-users don't like to see nondescript "Internal
|
||||
Server Errors" in their browsers.
|
||||
|
||||
|
||||
Setting up CGI on your own server
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If you don't have your own web server, this does not apply to you. You can
|
||||
check whether it works as-is, and if not you will need to talk to the
|
||||
administrator of your web server. If it is a big host, you can try filing a
|
||||
ticket asking for Python support.
|
||||
|
||||
If you are your own administrator or want to set up CGI for testing purposes on
|
||||
your own computers, you have to configure it by yourself. There is no single
|
||||
way to configure CGI, as there are many web servers with different
|
||||
configuration options. Currently the most widely used free web server is
|
||||
`Apache HTTPd <http://httpd.apache.org/>`_, or Apache for short. Apache can be
|
||||
easily installed on nearly every system using the system's package management
|
||||
tool. `lighttpd <http://www.lighttpd.net>`_ is another alternative and is
|
||||
said to have better performance. On many systems this server can also be
|
||||
installed using the package management tool, so manually compiling the web
|
||||
server may not be needed.
|
||||
|
||||
* On Apache you can take a look at the `Dynamic Content with CGI
|
||||
<http://httpd.apache.org/docs/2.2/howto/cgi.html>`_ tutorial, where everything
|
||||
is described. Most of the time it is enough just to set ``+ExecCGI``. The
|
||||
tutorial also describes the most common gotchas that might arise.
|
||||
|
||||
* On lighttpd you need to use the `CGI module
|
||||
<http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModCGI>`_\ , which can be configured
|
||||
in a straightforward way. It boils down to setting ``cgi.assign`` properly.
|
||||
|
||||
|
||||
Common problems with CGI scripts
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Using CGI sometimes leads to small annoyances while trying to get these
|
||||
scripts to run. Sometimes a seemingly correct script does not work as
|
||||
expected, the cause being some small hidden problem that's difficult to spot.
|
||||
|
||||
Some of these potential problems are:
|
||||
|
||||
* The Python script is not marked as executable. When CGI scripts are not
|
||||
executable most web servers will let the user download it, instead of
|
||||
running it and sending the output to the user. For CGI scripts to run
|
||||
properly on Unix-like operating systems, the ``+x`` bit needs to be set.
|
||||
Using ``chmod a+x your_script.py`` may solve this problem.
|
||||
|
||||
* On a Unix-like system, The line endings in the program file must be Unix
|
||||
style line endings. This is important because the web server checks the
|
||||
first line of the script (called shebang) and tries to run the program
|
||||
specified there. It gets easily confused by Windows line endings (Carriage
|
||||
Return & Line Feed, also called CRLF), so you have to convert the file to
|
||||
Unix line endings (only Line Feed, LF). This can be done automatically by
|
||||
uploading the file via FTP in text mode instead of binary mode, but the
|
||||
preferred way is just telling your editor to save the files with Unix line
|
||||
endings. Most editors support this.
|
||||
|
||||
* Your web server must be able to read the file, and you need to make sure the
|
||||
permissions are correct. On unix-like systems, the server often runs as user
|
||||
and group ``www-data``, so it might be worth a try to change the file
|
||||
ownership, or making the file world readable by using ``chmod a+r
|
||||
your_script.py``.
|
||||
|
||||
* The web server must know that the file you're trying to access is a CGI script.
|
||||
Check the configuration of your web server, as it may be configured
|
||||
to expect a specific file extension for CGI scripts.
|
||||
|
||||
* On Unix-like systems, the path to the interpreter in the shebang
|
||||
(``#!/usr/bin/env python``) must be correct. This line calls
|
||||
``/usr/bin/env`` to find Python, but it will fail if there is no
|
||||
``/usr/bin/env``, or if Python is not in the web server's path. If you know
|
||||
where your Python is installed, you can also use that full path. The
|
||||
commands ``whereis python`` and ``type -p python`` could help you find
|
||||
where it is installed. Once you know the path, you can change the shebang
|
||||
accordingly: ``#!/usr/bin/python``.
|
||||
|
||||
* The file must not contain a BOM (Byte Order Mark). The BOM is meant for
|
||||
determining the byte order of UTF-16 and UTF-32 encodings, but some editors
|
||||
write this also into UTF-8 files. The BOM interferes with the shebang line,
|
||||
so be sure to tell your editor not to write the BOM.
|
||||
|
||||
* If the web server is using :ref:`mod-python`, ``mod_python`` may be having
|
||||
problems. ``mod_python`` is able to handle CGI scripts by itself, but it can
|
||||
also be a source of issues.
|
||||
|
||||
|
||||
.. _mod-python:
|
||||
|
||||
mod_python
|
||||
----------
|
||||
|
||||
People coming from PHP often find it hard to grasp how to use Python in the web.
|
||||
Their first thought is mostly `mod_python <http://modpython.org/>`_\ ,
|
||||
because they think that this is the equivalent to ``mod_php``. Actually, there
|
||||
are many differences. What ``mod_python`` does is embed the interpreter into
|
||||
the Apache process, thus speeding up requests by not having to start a Python
|
||||
interpreter for each request. On the other hand, it is not "Python intermixed
|
||||
with HTML" in the way that PHP is often intermixed with HTML. The Python
|
||||
equivalent of that is a template engine. ``mod_python`` itself is much more
|
||||
powerful and provides more access to Apache internals. It can emulate CGI,
|
||||
work in a "Python Server Pages" mode (similar to JSP) which is "HTML
|
||||
intermingled with Python", and it has a "Publisher" which designates one file
|
||||
to accept all requests and decide what to do with them.
|
||||
|
||||
``mod_python`` does have some problems. Unlike the PHP interpreter, the Python
|
||||
interpreter uses caching when executing files, so changes to a file will
|
||||
require the web server to be restarted. Another problem is the basic concept
|
||||
-- Apache starts child processes to handle the requests, and unfortunately
|
||||
every child process needs to load the whole Python interpreter even if it does
|
||||
not use it. This makes the whole web server slower. Another problem is that,
|
||||
because ``mod_python`` is linked against a specific version of ``libpython``,
|
||||
it is not possible to switch from an older version to a newer (e.g. 2.4 to 2.5)
|
||||
without recompiling ``mod_python``. ``mod_python`` is also bound to the Apache
|
||||
web server, so programs written for ``mod_python`` cannot easily run on other
|
||||
web servers.
|
||||
|
||||
These are the reasons why ``mod_python`` should be avoided when writing new
|
||||
programs. In some circumstances it still might be a good idea to use
|
||||
``mod_python`` for deployment, but WSGI makes it possible to run WSGI programs
|
||||
under ``mod_python`` as well.
|
||||
|
||||
|
||||
FastCGI and SCGI
|
||||
----------------
|
||||
|
||||
FastCGI and SCGI try to solve the performance problem of CGI in another way.
|
||||
Instead of embedding the interpreter into the web server, they create
|
||||
long-running background processes. There is still a module in the web server
|
||||
which makes it possible for the web server to "speak" with the background
|
||||
process. As the background process is independent of the server, it can be
|
||||
written in any language, including Python. The language just needs to have a
|
||||
library which handles the communication with the webserver.
|
||||
|
||||
The difference between FastCGI and SCGI is very small, as SCGI is essentially
|
||||
just a "simpler FastCGI". As the web server support for SCGI is limited,
|
||||
most people use FastCGI instead, which works the same way. Almost everything
|
||||
that applies to SCGI also applies to FastCGI as well, so we'll only cover
|
||||
the latter.
|
||||
|
||||
These days, FastCGI is never used directly. Just like ``mod_python``, it is only
|
||||
used for the deployment of WSGI applications.
|
||||
|
||||
|
||||
Setting up FastCGI
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Each web server requires a specific module.
|
||||
|
||||
* Apache has both `mod_fastcgi <http://www.fastcgi.com/drupal/>`_ and `mod_fcgid
|
||||
<https://httpd.apache.org/mod_fcgid/>`_. ``mod_fastcgi`` is the original one, but it
|
||||
has some licensing issues, which is why it is sometimes considered non-free.
|
||||
``mod_fcgid`` is a smaller, compatible alternative. One of these modules needs
|
||||
to be loaded by Apache.
|
||||
|
||||
* lighttpd ships its own `FastCGI module
|
||||
<http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModFastCGI>`_ as well as an
|
||||
`SCGI module <http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModSCGI>`_.
|
||||
|
||||
* `nginx <http://nginx.org/>`_ also supports `FastCGI
|
||||
<https://www.nginx.com/resources/wiki/start/topics/examples/simplepythonfcgi/>`_.
|
||||
|
||||
Once you have installed and configured the module, you can test it with the
|
||||
following WSGI-application::
|
||||
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: UTF-8 -*-
|
||||
|
||||
from cgi import escape
|
||||
import sys, os
|
||||
from flup.server.fcgi import WSGIServer
|
||||
|
||||
def app(environ, start_response):
|
||||
start_response('200 OK', [('Content-Type', 'text/html')])
|
||||
|
||||
yield '<h1>FastCGI Environment</h1>'
|
||||
yield '<table>'
|
||||
for k, v in sorted(environ.items()):
|
||||
yield '<tr><th>%s</th><td>%s</td></tr>' % (escape(k), escape(v))
|
||||
yield '</table>'
|
||||
|
||||
WSGIServer(app).run()
|
||||
|
||||
This is a simple WSGI application, but you need to install `flup
|
||||
<https://pypi.org/project/flup/1.0>`_ first, as flup handles the low level
|
||||
FastCGI access.
|
||||
|
||||
.. seealso::
|
||||
|
||||
There is some documentation on `setting up Django with WSGI
|
||||
<https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/>`_, most of
|
||||
which can be reused for other WSGI-compliant frameworks and libraries.
|
||||
Only the ``manage.py`` part has to be changed, the example used here can be
|
||||
used instead. Django does more or less the exact same thing.
|
||||
|
||||
|
||||
mod_wsgi
|
||||
--------
|
||||
|
||||
`mod_wsgi <http://code.google.com/p/modwsgi/>`_ is an attempt to get rid of the
|
||||
low level gateways. Given that FastCGI, SCGI, and mod_python are mostly used to
|
||||
deploy WSGI applications, mod_wsgi was started to directly embed WSGI applications
|
||||
into the Apache web server. mod_wsgi is specifically designed to host WSGI
|
||||
applications. It makes the deployment of WSGI applications much easier than
|
||||
deployment using other low level methods, which need glue code. The downside
|
||||
is that mod_wsgi is limited to the Apache web server; other servers would need
|
||||
their own implementations of mod_wsgi.
|
||||
|
||||
mod_wsgi supports two modes: embedded mode, in which it integrates with the
|
||||
Apache process, and daemon mode, which is more FastCGI-like. Unlike FastCGI,
|
||||
mod_wsgi handles the worker-processes by itself, which makes administration
|
||||
easier.
|
||||
|
||||
|
||||
.. _WSGI:
|
||||
|
||||
Step back: WSGI
|
||||
===============
|
||||
|
||||
WSGI has already been mentioned several times, so it has to be something
|
||||
important. In fact it really is, and now it is time to explain it.
|
||||
|
||||
The *Web Server Gateway Interface*, or WSGI for short, is defined in
|
||||
:pep:`333` and is currently the best way to do Python web programming. While
|
||||
it is great for programmers writing frameworks, a normal web developer does not
|
||||
need to get in direct contact with it. When choosing a framework for web
|
||||
development it is a good idea to choose one which supports WSGI.
|
||||
|
||||
The big benefit of WSGI is the unification of the application programming
|
||||
interface. When your program is compatible with WSGI -- which at the outer
|
||||
level means that the framework you are using has support for WSGI -- your
|
||||
program can be deployed via any web server interface for which there are WSGI
|
||||
wrappers. You do not need to care about whether the application user uses
|
||||
mod_python or FastCGI or mod_wsgi -- with WSGI your application will work on
|
||||
any gateway interface. The Python standard library contains its own WSGI
|
||||
server, :mod:`wsgiref`, which is a small web server that can be used for
|
||||
testing.
|
||||
|
||||
A really great WSGI feature is middleware. Middleware is a layer around your
|
||||
program which can add various functionality to it. There is quite a bit of
|
||||
`middleware <https://wsgi.readthedocs.org/en/latest/libraries.html>`_ already
|
||||
available. For example, instead of writing your own session management (HTTP
|
||||
is a stateless protocol, so to associate multiple HTTP requests with a single
|
||||
user your application must create and manage such state via a session), you can
|
||||
just download middleware which does that, plug it in, and get on with coding
|
||||
the unique parts of your application. The same thing with compression -- there
|
||||
is existing middleware which handles compressing your HTML using gzip to save
|
||||
on your server's bandwidth. Authentication is another problem that is easily
|
||||
solved using existing middleware.
|
||||
|
||||
Although WSGI may seem complex, the initial phase of learning can be very
|
||||
rewarding because WSGI and the associated middleware already have solutions to
|
||||
many problems that might arise while developing web sites.
|
||||
|
||||
|
||||
WSGI Servers
|
||||
------------
|
||||
|
||||
The code that is used to connect to various low level gateways like CGI or
|
||||
mod_python is called a *WSGI server*. One of these servers is ``flup``, which
|
||||
supports FastCGI and SCGI, as well as `AJP
|
||||
<https://en.wikipedia.org/wiki/Apache_JServ_Protocol>`_. Some of these servers
|
||||
are written in Python, as ``flup`` is, but there also exist others which are
|
||||
written in C and can be used as drop-in replacements.
|
||||
|
||||
There are many servers already available, so a Python web application
|
||||
can be deployed nearly anywhere. This is one big advantage that Python has
|
||||
compared with other web technologies.
|
||||
|
||||
.. seealso::
|
||||
|
||||
A good overview of WSGI-related code can be found in the `WSGI homepage
|
||||
<https://wsgi.readthedocs.org/>`_, which contains an extensive list of `WSGI servers
|
||||
<https://wsgi.readthedocs.org/en/latest/servers.html>`_ which can be used by *any* application
|
||||
supporting WSGI.
|
||||
|
||||
You might be interested in some WSGI-supporting modules already contained in
|
||||
the standard library, namely:
|
||||
|
||||
* :mod:`wsgiref` -- some tiny utilities and servers for WSGI
|
||||
|
||||
|
||||
Case study: MoinMoin
|
||||
--------------------
|
||||
|
||||
What does WSGI give the web application developer? Let's take a look at
|
||||
an application that's been around for a while, which was written in
|
||||
Python without using WSGI.
|
||||
|
||||
One of the most widely used wiki software packages is `MoinMoin
|
||||
<https://moinmo.in/>`_. It was created in 2000, so it predates WSGI by about
|
||||
three years. Older versions needed separate code to run on CGI, mod_python,
|
||||
FastCGI and standalone.
|
||||
|
||||
It now includes support for WSGI. Using WSGI, it is possible to deploy
|
||||
MoinMoin on any WSGI compliant server, with no additional glue code.
|
||||
Unlike the pre-WSGI versions, this could include WSGI servers that the
|
||||
authors of MoinMoin know nothing about.
|
||||
|
||||
|
||||
Model-View-Controller
|
||||
=====================
|
||||
|
||||
The term *MVC* is often encountered in statements such as "framework *foo*
|
||||
supports MVC". MVC is more about the overall organization of code, rather than
|
||||
any particular API. Many web frameworks use this model to help the developer
|
||||
bring structure to their program. Bigger web applications can have lots of
|
||||
code, so it is a good idea to have an effective structure right from the beginning.
|
||||
That way, even users of other frameworks (or even other languages, since MVC is
|
||||
not Python-specific) can easily understand the code, given that they are
|
||||
already familiar with the MVC structure.
|
||||
|
||||
MVC stands for three components:
|
||||
|
||||
* The *model*. This is the data that will be displayed and modified. In
|
||||
Python frameworks, this component is often represented by the classes used by
|
||||
an object-relational mapper.
|
||||
|
||||
* The *view*. This component's job is to display the data of the model to the
|
||||
user. Typically this component is implemented via templates.
|
||||
|
||||
* The *controller*. This is the layer between the user and the model. The
|
||||
controller reacts to user actions (like opening some specific URL), tells
|
||||
the model to modify the data if necessary, and tells the view code what to
|
||||
display,
|
||||
|
||||
While one might think that MVC is a complex design pattern, in fact it is not.
|
||||
It is used in Python because it has turned out to be useful for creating clean,
|
||||
maintainable web sites.
|
||||
|
||||
.. note::
|
||||
|
||||
While not all Python frameworks explicitly support MVC, it is often trivial
|
||||
to create a web site which uses the MVC pattern by separating the data logic
|
||||
(the model) from the user interaction logic (the controller) and the
|
||||
templates (the view). That's why it is important not to write unnecessary
|
||||
Python code in the templates -- it works against the MVC model and creates
|
||||
chaos in the code base, making it harder to understand and modify.
|
||||
|
||||
.. seealso::
|
||||
|
||||
The English Wikipedia has an article about the `Model-View-Controller pattern
|
||||
<https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller>`_. It includes a long
|
||||
list of web frameworks for various programming languages.
|
||||
|
||||
|
||||
Ingredients for Websites
|
||||
========================
|
||||
|
||||
Websites are complex constructs, so tools have been created to help web
|
||||
developers make their code easier to write and more maintainable. Tools like
|
||||
these exist for all web frameworks in all languages. Developers are not forced
|
||||
to use these tools, and often there is no "best" tool. It is worth learning
|
||||
about the available tools because they can greatly simplify the process of
|
||||
developing a web site.
|
||||
|
||||
|
||||
.. seealso::
|
||||
|
||||
There are far more components than can be presented here. The Python wiki
|
||||
has a page about these components, called
|
||||
`Web Components <https://wiki.python.org/moin/WebComponents>`_.
|
||||
|
||||
|
||||
Templates
|
||||
---------
|
||||
|
||||
Mixing of HTML and Python code is made possible by a few libraries. While
|
||||
convenient at first, it leads to horribly unmaintainable code. That's why
|
||||
templates exist. Templates are, in the simplest case, just HTML files with
|
||||
placeholders. The HTML is sent to the user's browser after filling in the
|
||||
placeholders.
|
||||
|
||||
Python already includes two ways to build simple templates::
|
||||
|
||||
>>> template = "<html><body><h1>Hello %s!</h1></body></html>"
|
||||
>>> print template % "Reader"
|
||||
<html><body><h1>Hello Reader!</h1></body></html>
|
||||
|
||||
>>> from string import Template
|
||||
>>> template = Template("<html><body><h1>Hello ${name}</h1></body></html>")
|
||||
>>> print template.substitute(dict(name='Dinsdale'))
|
||||
<html><body><h1>Hello Dinsdale!</h1></body></html>
|
||||
|
||||
To generate complex HTML based on non-trivial model data, conditional
|
||||
and looping constructs like Python's *for* and *if* are generally needed.
|
||||
*Template engines* support templates of this complexity.
|
||||
|
||||
There are a lot of template engines available for Python which can be used with
|
||||
or without a `framework`_. Some of these define a plain-text programming
|
||||
language which is easy to learn, partly because it is limited in scope.
|
||||
Others use XML, and the template output is guaranteed to be always be valid
|
||||
XML. There are many other variations.
|
||||
|
||||
Some `frameworks`_ ship their own template engine or recommend one in
|
||||
particular. In the absence of a reason to use a different template engine,
|
||||
using the one provided by or recommended by the framework is a good idea.
|
||||
|
||||
Popular template engines include:
|
||||
|
||||
* `Mako <http://www.makotemplates.org/>`_
|
||||
* `Genshi <http://genshi.edgewall.org/>`_
|
||||
* `Jinja <http://jinja.pocoo.org/>`_
|
||||
|
||||
.. seealso::
|
||||
|
||||
There are many template engines competing for attention, because it is
|
||||
pretty easy to create them in Python. The page `Templating
|
||||
<https://wiki.python.org/moin/Templating>`_ in the wiki lists a big,
|
||||
ever-growing number of these. The three listed above are considered "second
|
||||
generation" template engines and are a good place to start.
|
||||
|
||||
|
||||
Data persistence
|
||||
----------------
|
||||
|
||||
*Data persistence*, while sounding very complicated, is just about storing data.
|
||||
This data might be the text of blog entries, the postings on a bulletin board or
|
||||
the text of a wiki page. There are, of course, a number of different ways to store
|
||||
information on a web server.
|
||||
|
||||
Often, relational database engines like `MySQL <http://www.mysql.com/>`_ or
|
||||
`PostgreSQL <http://www.postgresql.org/>`_ are used because of their good
|
||||
performance when handling very large databases consisting of millions of
|
||||
entries. There is also a small database engine called `SQLite
|
||||
<http://www.sqlite.org/>`_, which is bundled with Python in the :mod:`sqlite3`
|
||||
module, and which uses only one file. It has no other dependencies. For
|
||||
smaller sites SQLite is just enough.
|
||||
|
||||
Relational databases are *queried* using a language called `SQL
|
||||
<https://en.wikipedia.org/wiki/SQL>`_. Python programmers in general do not
|
||||
like SQL too much, as they prefer to work with objects. It is possible to save
|
||||
Python objects into a database using a technology called `ORM
|
||||
<https://en.wikipedia.org/wiki/Object-relational_mapping>`_ (Object Relational
|
||||
Mapping). ORM translates all object-oriented access into SQL code under the
|
||||
hood, so the developer does not need to think about it. Most `frameworks`_ use
|
||||
ORMs, and it works quite well.
|
||||
|
||||
A second possibility is storing data in normal, plain text files (some
|
||||
times called "flat files"). This is very easy for simple sites,
|
||||
but can be difficult to get right if the web site is performing many
|
||||
updates to the stored data.
|
||||
|
||||
A third possibility are object oriented databases (also called "object
|
||||
databases"). These databases store the object data in a form that closely
|
||||
parallels the way the objects are structured in memory during program
|
||||
execution. (By contrast, ORMs store the object data as rows of data in tables
|
||||
and relations between those rows.) Storing the objects directly has the
|
||||
advantage that nearly all objects can be saved in a straightforward way, unlike
|
||||
in relational databases where some objects are very hard to represent.
|
||||
|
||||
`Frameworks`_ often give hints on which data storage method to choose. It is
|
||||
usually a good idea to stick to the data store recommended by the framework
|
||||
unless the application has special requirements better satisfied by an
|
||||
alternate storage mechanism.
|
||||
|
||||
.. seealso::
|
||||
|
||||
* `Persistence Tools <https://wiki.python.org/moin/PersistenceTools>`_ lists
|
||||
possibilities on how to save data in the file system. Some of these
|
||||
modules are part of the standard library
|
||||
|
||||
* `Database Programming <https://wiki.python.org/moin/DatabaseProgramming>`_
|
||||
helps with choosing a method for saving data
|
||||
|
||||
* `SQLAlchemy <http://www.sqlalchemy.org/>`_, the most powerful OR-Mapper
|
||||
for Python, and `Elixir <https://pypi.org/project/Elixir>`_, which makes
|
||||
SQLAlchemy easier to use
|
||||
|
||||
* `SQLObject <http://www.sqlobject.org/>`_, another popular OR-Mapper
|
||||
|
||||
* `ZODB <https://launchpad.net/zodb>`_ and `Durus
|
||||
<https://www.mems-exchange.org/software/>`_, two object oriented
|
||||
databases
|
||||
|
||||
|
||||
.. _framework:
|
||||
|
||||
Frameworks
|
||||
==========
|
||||
|
||||
The process of creating code to run web sites involves writing code to provide
|
||||
various services. The code to provide a particular service often works the
|
||||
same way regardless of the complexity or purpose of the web site in question.
|
||||
Abstracting these common solutions into reusable code produces what are called
|
||||
"frameworks" for web development. Perhaps the most well-known framework for
|
||||
web development is Ruby on Rails, but Python has its own frameworks. Some of
|
||||
these were partly inspired by Rails, or borrowed ideas from Rails, but many
|
||||
existed a long time before Rails.
|
||||
|
||||
Originally Python web frameworks tended to incorporate all of the services
|
||||
needed to develop web sites as a giant, integrated set of tools. No two web
|
||||
frameworks were interoperable: a program developed for one could not be
|
||||
deployed on a different one without considerable re-engineering work. This led
|
||||
to the development of "minimalist" web frameworks that provided just the tools
|
||||
to communicate between the Python code and the http protocol, with all other
|
||||
services to be added on top via separate components. Some ad hoc standards
|
||||
were developed that allowed for limited interoperability between frameworks,
|
||||
such as a standard that allowed different template engines to be used
|
||||
interchangeably.
|
||||
|
||||
Since the advent of WSGI, the Python web framework world has been evolving
|
||||
toward interoperability based on the WSGI standard. Now many web frameworks,
|
||||
whether "full stack" (providing all the tools one needs to deploy the most
|
||||
complex web sites) or minimalist, or anything in between, are built from
|
||||
collections of reusable components that can be used with more than one
|
||||
framework.
|
||||
|
||||
The majority of users will probably want to select a "full stack" framework
|
||||
that has an active community. These frameworks tend to be well documented,
|
||||
and provide the easiest path to producing a fully functional web site in
|
||||
minimal time.
|
||||
|
||||
|
||||
Some notable frameworks
|
||||
-----------------------
|
||||
|
||||
There are an incredible number of frameworks, so they cannot all be covered
|
||||
here. Instead we will briefly touch on some of the most popular.
|
||||
|
||||
|
||||
Django
|
||||
^^^^^^
|
||||
|
||||
`Django <https://www.djangoproject.com/>`_ is a framework consisting of several
|
||||
tightly coupled elements which were written from scratch and work together very
|
||||
well. It includes an ORM which is quite powerful while being simple to use,
|
||||
and has a great online administration interface which makes it possible to edit
|
||||
the data in the database with a browser. The template engine is text-based and
|
||||
is designed to be usable for page designers who cannot write Python. It
|
||||
supports template inheritance and filters (which work like Unix pipes). Django
|
||||
has many handy features bundled, such as creation of RSS feeds or generic views,
|
||||
which make it possible to create web sites almost without writing any Python code.
|
||||
|
||||
It has a big, international community, the members of which have created many
|
||||
web sites. There are also a lot of add-on projects which extend Django's normal
|
||||
functionality. This is partly due to Django's well written `online
|
||||
documentation <https://docs.djangoproject.com/>`_ and the `Django book
|
||||
<http://www.djangobook.com/>`_.
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
Although Django is an MVC-style framework, it names the elements
|
||||
differently, which is described in the `Django FAQ
|
||||
<https://docs.djangoproject.com/en/dev/faq/general/#django-appears-to-be-a-mvc-framework-but-you-call-the-controller-the-view-and-the-view-the-template-how-come-you-don-t-use-the-standard-names>`_.
|
||||
|
||||
|
||||
TurboGears
|
||||
^^^^^^^^^^
|
||||
|
||||
Another popular web framework for Python is `TurboGears
|
||||
<http://www.turbogears.org/>`_. TurboGears takes the approach of using already
|
||||
existing components and combining them with glue code to create a seamless
|
||||
experience. TurboGears gives the user flexibility in choosing components. For
|
||||
example the ORM and template engine can be changed to use packages different
|
||||
from those used by default.
|
||||
|
||||
The documentation can be found in the `TurboGears documentation
|
||||
<https://turbogears.readthedocs.org/>`_, where links to screencasts can be found.
|
||||
TurboGears has also an active user community which can respond to most related
|
||||
questions. There is also a `TurboGears book <http://turbogears.org/1.0/docs/TGBooks.html>`_
|
||||
published, which is a good starting point.
|
||||
|
||||
The newest version of TurboGears, version 2.0, moves even further in direction
|
||||
of WSGI support and a component-based architecture. TurboGears 2 is based on
|
||||
the WSGI stack of another popular component-based web framework, `Pylons
|
||||
<http://www.pylonsproject.org/>`_.
|
||||
|
||||
|
||||
Zope
|
||||
^^^^
|
||||
|
||||
The Zope framework is one of the "old original" frameworks. Its current
|
||||
incarnation in Zope2 is a tightly integrated full-stack framework. One of its
|
||||
most interesting feature is its tight integration with a powerful object
|
||||
database called the `ZODB <https://launchpad.net/zodb>`_ (Zope Object Database).
|
||||
Because of its highly integrated nature, Zope wound up in a somewhat isolated
|
||||
ecosystem: code written for Zope wasn't very usable outside of Zope, and
|
||||
vice-versa. To solve this problem the Zope 3 effort was started. Zope 3
|
||||
re-engineers Zope as a set of more cleanly isolated components. This effort
|
||||
was started before the advent of the WSGI standard, but there is WSGI support
|
||||
for Zope 3 from the `Repoze <http://repoze.org/>`_ project. Zope components
|
||||
have many years of production use behind them, and the Zope 3 project gives
|
||||
access to these components to the wider Python community. There is even a
|
||||
separate framework based on the Zope components: `Grok
|
||||
<http://grok.zope.org/>`_.
|
||||
|
||||
Zope is also the infrastructure used by the `Plone <https://plone.org/>`_ content
|
||||
management system, one of the most powerful and popular content management
|
||||
systems available.
|
||||
|
||||
|
||||
Other notable frameworks
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Of course these are not the only frameworks that are available. There are
|
||||
many other frameworks worth mentioning.
|
||||
|
||||
Another framework that's already been mentioned is `Pylons`_. Pylons is much
|
||||
like TurboGears, but with an even stronger emphasis on flexibility, which comes
|
||||
at the cost of being more difficult to use. Nearly every component can be
|
||||
exchanged, which makes it necessary to use the documentation of every single
|
||||
component, of which there are many. Pylons builds upon `Paste
|
||||
<http://pythonpaste.org/>`_, an extensive set of tools which are handy for WSGI.
|
||||
|
||||
And that's still not everything. The most up-to-date information can always be
|
||||
found in the Python wiki.
|
||||
|
||||
.. seealso::
|
||||
|
||||
The Python wiki contains an extensive list of `web frameworks
|
||||
<https://wiki.python.org/moin/WebFrameworks>`_.
|
||||
|
||||
Most frameworks also have their own mailing lists and IRC channels, look out
|
||||
for these on the projects' web sites.
|
||||
Reference in New Issue
Block a user