Import Upstream version 2.7.18

This commit is contained in:
geos_one
2025-08-15 16:28:06 +02:00
commit ba1f69ab39
4521 changed files with 1778434 additions and 0 deletions

766
Doc/howto/argparse.rst Normal file
View File

@@ -0,0 +1,766 @@
*****************
Argparse Tutorial
*****************
:author: Tshepang Lekhonkhobe
.. _argparse-tutorial:
This tutorial is intended to be a gentle introduction to :mod:`argparse`, the
recommended command-line parsing module in the Python standard library.
This was written for argparse in Python 3. A few details are different in 2.x,
especially some exception messages, which were improved in 3.x.
.. note::
There are two other modules that fulfill the same task, namely
:mod:`getopt` (an equivalent for :c:func:`getopt` from the C
language) and the deprecated :mod:`optparse`.
Note also that :mod:`argparse` is based on :mod:`optparse`,
and therefore very similar in terms of usage.
Concepts
========
Let's show the sort of functionality that we are going to explore in this
introductory tutorial by making use of the :command:`ls` command:
.. code-block:: sh
$ ls
cpython devguide prog.py pypy rm-unused-function.patch
$ ls pypy
ctypes_configure demo dotviewer include lib_pypy lib-python ...
$ ls -l
total 20
drwxr-xr-x 19 wena wena 4096 Feb 18 18:51 cpython
drwxr-xr-x 4 wena wena 4096 Feb 8 12:04 devguide
-rwxr-xr-x 1 wena wena 535 Feb 19 00:05 prog.py
drwxr-xr-x 14 wena wena 4096 Feb 7 00:59 pypy
-rw-r--r-- 1 wena wena 741 Feb 18 01:01 rm-unused-function.patch
$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
...
A few concepts we can learn from the four commands:
* The :command:`ls` command is useful when run without any options at all. It defaults
to displaying the contents of the current directory.
* If we want beyond what it provides by default, we tell it a bit more. In
this case, we want it to display a different directory, ``pypy``.
What we did is specify what is known as a positional argument. It's named so
because the program should know what to do with the value, solely based on
where it appears on the command line. This concept is more relevant
to a command like :command:`cp`, whose most basic usage is ``cp SRC DEST``.
The first position is *what you want copied,* and the second
position is *where you want it copied to*.
* Now, say we want to change behaviour of the program. In our example,
we display more info for each file instead of just showing the file names.
The ``-l`` in that case is known as an optional argument.
* That's a snippet of the help text. It's very useful in that you can
come across a program you have never used before, and can figure out
how it works simply by reading its help text.
The basics
==========
Let us start with a very simple example which does (almost) nothing::
import argparse
parser = argparse.ArgumentParser()
parser.parse_args()
Following is a result of running the code:
.. code-block:: sh
$ python prog.py
$ python prog.py --help
usage: prog.py [-h]
optional arguments:
-h, --help show this help message and exit
$ python prog.py --verbose
usage: prog.py [-h]
prog.py: error: unrecognized arguments: --verbose
$ python prog.py foo
usage: prog.py [-h]
prog.py: error: unrecognized arguments: foo
Here is what is happening:
* Running the script without any options results in nothing displayed to
stdout. Not so useful.
* The second one starts to display the usefulness of the :mod:`argparse`
module. We have done almost nothing, but already we get a nice help message.
* The ``--help`` option, which can also be shortened to ``-h``, is the only
option we get for free (i.e. no need to specify it). Specifying anything
else results in an error. But even then, we do get a useful usage message,
also for free.
Introducing Positional arguments
================================
An example::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("echo")
args = parser.parse_args()
print args.echo
And running the code:
.. code-block:: sh
$ python prog.py
usage: prog.py [-h] echo
prog.py: error: the following arguments are required: echo
$ python prog.py --help
usage: prog.py [-h] echo
positional arguments:
echo
optional arguments:
-h, --help show this help message and exit
$ python prog.py foo
foo
Here is what's happening:
* We've added the :meth:`add_argument` method, which is what we use to specify
which command-line options the program is willing to accept. In this case,
I've named it ``echo`` so that it's in line with its function.
* Calling our program now requires us to specify an option.
* The :meth:`parse_args` method actually returns some data from the
options specified, in this case, ``echo``.
* The variable is some form of 'magic' that :mod:`argparse` performs for free
(i.e. no need to specify which variable that value is stored in).
You will also notice that its name matches the string argument given
to the method, ``echo``.
Note however that, although the help display looks nice and all, it currently
is not as helpful as it can be. For example we see that we got ``echo`` as a
positional argument, but we don't know what it does, other than by guessing or
by reading the source code. So, let's make it a bit more useful::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("echo", help="echo the string you use here")
args = parser.parse_args()
print args.echo
And we get:
.. code-block:: sh
$ python prog.py -h
usage: prog.py [-h] echo
positional arguments:
echo echo the string you use here
optional arguments:
-h, --help show this help message and exit
Now, how about doing something even more useful::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", help="display a square of a given number")
args = parser.parse_args()
print args.square**2
Following is a result of running the code:
.. code-block:: sh
$ python prog.py 4
Traceback (most recent call last):
File "prog.py", line 5, in <module>
print args.square**2
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
That didn't go so well. That's because :mod:`argparse` treats the options we
give it as strings, unless we tell it otherwise. So, let's tell
:mod:`argparse` to treat that input as an integer::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", help="display a square of a given number",
type=int)
args = parser.parse_args()
print args.square**2
Following is a result of running the code:
.. code-block:: sh
$ python prog.py 4
16
$ python prog.py four
usage: prog.py [-h] square
prog.py: error: argument square: invalid int value: 'four'
That went well. The program now even helpfully quits on bad illegal input
before proceeding.
Introducing Optional arguments
==============================
So far we have been playing with positional arguments. Let us
have a look on how to add optional ones::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--verbosity", help="increase output verbosity")
args = parser.parse_args()
if args.verbosity:
print "verbosity turned on"
And the output:
.. code-block:: sh
$ python prog.py --verbosity 1
verbosity turned on
$ python prog.py
$ python prog.py --help
usage: prog.py [-h] [--verbosity VERBOSITY]
optional arguments:
-h, --help show this help message and exit
--verbosity VERBOSITY
increase output verbosity
$ python prog.py --verbosity
usage: prog.py [-h] [--verbosity VERBOSITY]
prog.py: error: argument --verbosity: expected one argument
Here is what is happening:
* The program is written so as to display something when ``--verbosity`` is
specified and display nothing when not.
* To show that the option is actually optional, there is no error when running
the program without it. Note that by default, if an optional argument isn't
used, the relevant variable, in this case :attr:`args.verbosity`, is
given ``None`` as a value, which is the reason it fails the truth
test of the :keyword:`if` statement.
* The help message is a bit different.
* When using the ``--verbosity`` option, one must also specify some value,
any value.
The above example accepts arbitrary integer values for ``--verbosity``, but for
our simple program, only two values are actually useful, ``True`` or ``False``.
Let's modify the code accordingly::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--verbose", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
if args.verbose:
print "verbosity turned on"
And the output:
.. code-block:: sh
$ python prog.py --verbose
verbosity turned on
$ python prog.py --verbose 1
usage: prog.py [-h] [--verbose]
prog.py: error: unrecognized arguments: 1
$ python prog.py --help
usage: prog.py [-h] [--verbose]
optional arguments:
-h, --help show this help message and exit
--verbose increase output verbosity
Here is what is happening:
* The option is now more of a flag than something that requires a value.
We even changed the name of the option to match that idea.
Note that we now specify a new keyword, ``action``, and give it the value
``"store_true"``. This means that, if the option is specified,
assign the value ``True`` to :data:`args.verbose`.
Not specifying it implies ``False``.
* It complains when you specify a value, in true spirit of what flags
actually are.
* Notice the different help text.
Short options
-------------
If you are familiar with command line usage,
you will notice that I haven't yet touched on the topic of short
versions of the options. It's quite simple::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-v", "--verbose", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
if args.verbose:
print "verbosity turned on"
And here goes:
.. code-block:: sh
$ python prog.py -v
verbosity turned on
$ python prog.py --help
usage: prog.py [-h] [-v]
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
Note that the new ability is also reflected in the help text.
Combining Positional and Optional arguments
===========================================
Our program keeps growing in complexity::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbose", action="store_true",
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbose:
print "the square of {} equals {}".format(args.square, answer)
else:
print answer
And now the output:
.. code-block:: sh
$ python prog.py
usage: prog.py [-h] [-v] square
prog.py: error: the following arguments are required: square
$ python prog.py 4
16
$ python prog.py 4 --verbose
the square of 4 equals 16
$ python prog.py --verbose 4
the square of 4 equals 16
* We've brought back a positional argument, hence the complaint.
* Note that the order does not matter.
How about we give this program of ours back the ability to have
multiple verbosity values, and actually get to use them::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", type=int,
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print "the square of {} equals {}".format(args.square, answer)
elif args.verbosity == 1:
print "{}^2 == {}".format(args.square, answer)
else:
print answer
And the output:
.. code-block:: sh
$ python prog.py 4
16
$ python prog.py 4 -v
usage: prog.py [-h] [-v VERBOSITY] square
prog.py: error: argument -v/--verbosity: expected one argument
$ python prog.py 4 -v 1
4^2 == 16
$ python prog.py 4 -v 2
the square of 4 equals 16
$ python prog.py 4 -v 3
16
These all look good except the last one, which exposes a bug in our program.
Let's fix it by restricting the values the ``--verbosity`` option can accept::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", type=int, choices=[0, 1, 2],
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print "the square of {} equals {}".format(args.square, answer)
elif args.verbosity == 1:
print "{}^2 == {}".format(args.square, answer)
else:
print answer
And the output:
.. code-block:: sh
$ python prog.py 4 -v 3
usage: prog.py [-h] [-v {0,1,2}] square
prog.py: error: argument -v/--verbosity: invalid choice: 3 (choose from 0, 1, 2)
$ python prog.py 4 -h
usage: prog.py [-h] [-v {0,1,2}] square
positional arguments:
square display a square of a given number
optional arguments:
-h, --help show this help message and exit
-v {0,1,2}, --verbosity {0,1,2}
increase output verbosity
Note that the change also reflects both in the error message as well as the
help string.
Now, let's use a different approach of playing with verbosity, which is pretty
common. It also matches the way the CPython executable handles its own
verbosity argument (check the output of ``python --help``)::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display the square of a given number")
parser.add_argument("-v", "--verbosity", action="count",
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print "the square of {} equals {}".format(args.square, answer)
elif args.verbosity == 1:
print "{}^2 == {}".format(args.square, answer)
else:
print answer
We have introduced another action, "count",
to count the number of occurrences of a specific optional arguments:
.. code-block:: sh
$ python prog.py 4
16
$ python prog.py 4 -v
4^2 == 16
$ python prog.py 4 -vv
the square of 4 equals 16
$ python prog.py 4 --verbosity --verbosity
the square of 4 equals 16
$ python prog.py 4 -v 1
usage: prog.py [-h] [-v] square
prog.py: error: unrecognized arguments: 1
$ python prog.py 4 -h
usage: prog.py [-h] [-v] square
positional arguments:
square display a square of a given number
optional arguments:
-h, --help show this help message and exit
-v, --verbosity increase output verbosity
$ python prog.py 4 -vvv
16
* Yes, it's now more of a flag (similar to ``action="store_true"``) in the
previous version of our script. That should explain the complaint.
* It also behaves similar to "store_true" action.
* Now here's a demonstration of what the "count" action gives. You've probably
seen this sort of usage before.
* And, just like the "store_true" action, if you don't specify the ``-v`` flag,
that flag is considered to have ``None`` value.
* As should be expected, specifying the long form of the flag, we should get
the same output.
* Sadly, our help output isn't very informative on the new ability our script
has acquired, but that can always be fixed by improving the documentation for
our script (e.g. via the ``help`` keyword argument).
* That last output exposes a bug in our program.
Let's fix::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", action="count",
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
# bugfix: replace == with >=
if args.verbosity >= 2:
print "the square of {} equals {}".format(args.square, answer)
elif args.verbosity >= 1:
print "{}^2 == {}".format(args.square, answer)
else:
print answer
And this is what it gives:
.. code-block:: sh
$ python prog.py 4 -vvv
the square of 4 equals 16
$ python prog.py 4 -vvvv
the square of 4 equals 16
$ python prog.py 4
Traceback (most recent call last):
File "prog.py", line 11, in <module>
if args.verbosity >= 2:
TypeError: unorderable types: NoneType() >= int()
* First output went well, and fixes the bug we had before.
That is, we want any value >= 2 to be as verbose as possible.
* Third output not so good.
Let's fix that bug::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", action="count", default=0,
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity >= 2:
print "the square of {} equals {}".format(args.square, answer)
elif args.verbosity >= 1:
print "{}^2 == {}".format(args.square, answer)
else:
print answer
We've just introduced yet another keyword, ``default``.
We've set it to ``0`` in order to make it comparable to the other int values.
Remember that by default,
if an optional argument isn't specified,
it gets the ``None`` value, and that cannot be compared to an int value
(hence the :exc:`TypeError` exception).
And:
.. code-block:: sh
$ python prog.py 4
16
You can go quite far just with what we've learned so far,
and we have only scratched the surface.
The :mod:`argparse` module is very powerful,
and we'll explore a bit more of it before we end this tutorial.
Getting a little more advanced
==============================
What if we wanted to expand our tiny program to perform other powers,
not just squares::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
parser.add_argument("-v", "--verbosity", action="count", default=0)
args = parser.parse_args()
answer = args.x**args.y
if args.verbosity >= 2:
print "{} to the power {} equals {}".format(args.x, args.y, answer)
elif args.verbosity >= 1:
print "{}^{} == {}".format(args.x, args.y, answer)
else:
print answer
Output:
.. code-block:: sh
$ python prog.py
usage: prog.py [-h] [-v] x y
prog.py: error: the following arguments are required: x, y
$ python prog.py -h
usage: prog.py [-h] [-v] x y
positional arguments:
x the base
y the exponent
optional arguments:
-h, --help show this help message and exit
-v, --verbosity
$ python prog.py 4 2 -v
4^2 == 16
Notice that so far we've been using verbosity level to *change* the text
that gets displayed. The following example instead uses verbosity level
to display *more* text instead::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
parser.add_argument("-v", "--verbosity", action="count", default=0)
args = parser.parse_args()
answer = args.x**args.y
if args.verbosity >= 2:
print "Running '{}'".format(__file__)
if args.verbosity >= 1:
print "{}^{} ==".format(args.x, args.y),
print answer
Output:
.. code-block:: sh
$ python prog.py 4 2
16
$ python prog.py 4 2 -v
4^2 == 16
$ python prog.py 4 2 -vv
Running 'prog.py'
4^2 == 16
Conflicting options
-------------------
So far, we have been working with two methods of an
:class:`argparse.ArgumentParser` instance. Let's introduce a third one,
:meth:`add_mutually_exclusive_group`. It allows for us to specify options that
conflict with each other. Let's also change the rest of the program so that
the new functionality makes more sense:
we'll introduce the ``--quiet`` option,
which will be the opposite of the ``--verbose`` one::
import argparse
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group()
group.add_argument("-v", "--verbose", action="store_true")
group.add_argument("-q", "--quiet", action="store_true")
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
args = parser.parse_args()
answer = args.x**args.y
if args.quiet:
print answer
elif args.verbose:
print "{} to the power {} equals {}".format(args.x, args.y, answer)
else:
print "{}^{} == {}".format(args.x, args.y, answer)
Our program is now simpler, and we've lost some functionality for the sake of
demonstration. Anyways, here's the output:
.. code-block:: sh
$ python prog.py 4 2
4^2 == 16
$ python prog.py 4 2 -q
16
$ python prog.py 4 2 -v
4 to the power 2 equals 16
$ python prog.py 4 2 -vq
usage: prog.py [-h] [-v | -q] x y
prog.py: error: argument -q/--quiet: not allowed with argument -v/--verbose
$ python prog.py 4 2 -v --quiet
usage: prog.py [-h] [-v | -q] x y
prog.py: error: argument -q/--quiet: not allowed with argument -v/--verbose
That should be easy to follow. I've added that last output so you can see the
sort of flexibility you get, i.e. mixing long form options with short form
ones.
Before we conclude, you probably want to tell your users the main purpose of
your program, just in case they don't know::
import argparse
parser = argparse.ArgumentParser(description="calculate X to the power of Y")
group = parser.add_mutually_exclusive_group()
group.add_argument("-v", "--verbose", action="store_true")
group.add_argument("-q", "--quiet", action="store_true")
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
args = parser.parse_args()
answer = args.x**args.y
if args.quiet:
print answer
elif args.verbose:
print "{} to the power {} equals {}".format(args.x, args.y, answer)
else:
print "{}^{} == {}".format(args.x, args.y, answer)
Note that slight difference in the usage text. Note the ``[-v | -q]``,
which tells us that we can either use ``-v`` or ``-q``,
but not both at the same time:
.. code-block:: sh
$ python prog.py --help
usage: prog.py [-h] [-v | -q] x y
calculate X to the power of Y
positional arguments:
x the base
y the exponent
optional arguments:
-h, --help show this help message and exit
-v, --verbose
-q, --quiet
Conclusion
==========
The :mod:`argparse` module offers a lot more than shown here.
Its docs are quite detailed and thorough, and full of examples.
Having gone through this tutorial, you should easily digest them
without feeling overwhelmed.

257
Doc/howto/cporting.rst Normal file
View File

@@ -0,0 +1,257 @@
.. highlightlang:: c
.. _cporting-howto:
*************************************
Porting Extension Modules to Python 3
*************************************
:author: Benjamin Peterson
.. topic:: Abstract
Although changing the C-API was not one of Python 3's objectives,
the many Python-level changes made leaving Python 2's API intact
impossible. In fact, some changes such as :func:`int` and
:func:`long` unification are more obvious on the C level. This
document endeavors to document incompatibilities and how they can
be worked around.
Conditional compilation
=======================
The easiest way to compile only some code for Python 3 is to check
if :c:macro:`PY_MAJOR_VERSION` is greater than or equal to 3. ::
#if PY_MAJOR_VERSION >= 3
#define IS_PY3K
#endif
API functions that are not present can be aliased to their equivalents within
conditional blocks.
Changes to Object APIs
======================
Python 3 merged together some types with similar functions while cleanly
separating others.
str/unicode Unification
-----------------------
Python 3's :func:`str` type is equivalent to Python 2's :func:`unicode`; the C
functions are called ``PyUnicode_*`` for both. The old 8-bit string type has become
:func:`bytes`, with C functions called ``PyBytes_*``. Python 2.6 and later provide a compatibility header,
:file:`bytesobject.h`, mapping ``PyBytes`` names to ``PyString`` ones. For best
compatibility with Python 3, :c:type:`PyUnicode` should be used for textual data and
:c:type:`PyBytes` for binary data. It's also important to remember that
:c:type:`PyBytes` and :c:type:`PyUnicode` in Python 3 are not interchangeable like
:c:type:`PyString` and :c:type:`PyUnicode` are in Python 2. The following example
shows best practices with regards to :c:type:`PyUnicode`, :c:type:`PyString`,
and :c:type:`PyBytes`. ::
#include "stdlib.h"
#include "Python.h"
#include "bytesobject.h"
/* text example */
static PyObject *
say_hello(PyObject *self, PyObject *args) {
PyObject *name, *result;
if (!PyArg_ParseTuple(args, "U:say_hello", &name))
return NULL;
result = PyUnicode_FromFormat("Hello, %S!", name);
return result;
}
/* just a forward */
static char * do_encode(PyObject *);
/* bytes example */
static PyObject *
encode_object(PyObject *self, PyObject *args) {
char *encoded;
PyObject *result, *myobj;
if (!PyArg_ParseTuple(args, "O:encode_object", &myobj))
return NULL;
encoded = do_encode(myobj);
if (encoded == NULL)
return NULL;
result = PyBytes_FromString(encoded);
free(encoded);
return result;
}
long/int Unification
--------------------
Python 3 has only one integer type, :func:`int`. But it actually
corresponds to Python 2's :func:`long` type—the :func:`int` type
used in Python 2 was removed. In the C-API, ``PyInt_*`` functions
are replaced by their ``PyLong_*`` equivalents.
Module initialization and state
===============================
Python 3 has a revamped extension module initialization system. (See
:pep:`3121`.) Instead of storing module state in globals, they should
be stored in an interpreter specific structure. Creating modules that
act correctly in both Python 2 and Python 3 is tricky. The following
simple example demonstrates how. ::
#include "Python.h"
struct module_state {
PyObject *error;
};
#if PY_MAJOR_VERSION >= 3
#define GETSTATE(m) ((struct module_state*)PyModule_GetState(m))
#else
#define GETSTATE(m) (&_state)
static struct module_state _state;
#endif
static PyObject *
error_out(PyObject *m) {
struct module_state *st = GETSTATE(m);
PyErr_SetString(st->error, "something bad happened");
return NULL;
}
static PyMethodDef myextension_methods[] = {
{"error_out", (PyCFunction)error_out, METH_NOARGS, NULL},
{NULL, NULL}
};
#if PY_MAJOR_VERSION >= 3
static int myextension_traverse(PyObject *m, visitproc visit, void *arg) {
Py_VISIT(GETSTATE(m)->error);
return 0;
}
static int myextension_clear(PyObject *m) {
Py_CLEAR(GETSTATE(m)->error);
return 0;
}
static struct PyModuleDef moduledef = {
PyModuleDef_HEAD_INIT,
"myextension",
NULL,
sizeof(struct module_state),
myextension_methods,
NULL,
myextension_traverse,
myextension_clear,
NULL
};
#define INITERROR return NULL
PyMODINIT_FUNC
PyInit_myextension(void)
#else
#define INITERROR return
void
initmyextension(void)
#endif
{
#if PY_MAJOR_VERSION >= 3
PyObject *module = PyModule_Create(&moduledef);
#else
PyObject *module = Py_InitModule("myextension", myextension_methods);
#endif
if (module == NULL)
INITERROR;
struct module_state *st = GETSTATE(module);
st->error = PyErr_NewException("myextension.Error", NULL, NULL);
if (st->error == NULL) {
Py_DECREF(module);
INITERROR;
}
#if PY_MAJOR_VERSION >= 3
return module;
#endif
}
CObject replaced with Capsule
=============================
The :c:type:`Capsule` object was introduced in Python 3.1 and 2.7 to replace
:c:type:`CObject`. CObjects were useful,
but the :c:type:`CObject` API was problematic: it didn't permit distinguishing
between valid CObjects, which allowed mismatched CObjects to crash the
interpreter, and some of its APIs relied on undefined behavior in C.
(For further reading on the rationale behind Capsules, please see :issue:`5630`.)
If you're currently using CObjects, and you want to migrate to 3.1 or newer,
you'll need to switch to Capsules.
:c:type:`CObject` was deprecated in 3.1 and 2.7 and completely removed in
Python 3.2. If you only support 2.7, or 3.1 and above, you
can simply switch to :c:type:`Capsule`. If you need to support Python 3.0,
or versions of Python earlier than 2.7,
you'll have to support both CObjects and Capsules.
(Note that Python 3.0 is no longer supported, and it is not recommended
for production use.)
The following example header file :file:`capsulethunk.h` may
solve the problem for you. Simply write your code against the
:c:type:`Capsule` API and include this header file after
:file:`Python.h`. Your code will automatically use Capsules
in versions of Python with Capsules, and switch to CObjects
when Capsules are unavailable.
:file:`capsulethunk.h` simulates Capsules using CObjects. However,
:c:type:`CObject` provides no place to store the capsule's "name". As a
result the simulated :c:type:`Capsule` objects created by :file:`capsulethunk.h`
behave slightly differently from real Capsules. Specifically:
* The name parameter passed in to :c:func:`PyCapsule_New` is ignored.
* The name parameter passed in to :c:func:`PyCapsule_IsValid` and
:c:func:`PyCapsule_GetPointer` is ignored, and no error checking
of the name is performed.
* :c:func:`PyCapsule_GetName` always returns NULL.
* :c:func:`PyCapsule_SetName` always raises an exception and
returns failure. (Since there's no way to store a name
in a CObject, noisy failure of :c:func:`PyCapsule_SetName`
was deemed preferable to silent failure here. If this is
inconvenient, feel free to modify your local
copy as you see fit.)
You can find :file:`capsulethunk.h` in the Python source distribution
as :source:`Doc/includes/capsulethunk.h`. We also include it here for
your convenience:
.. literalinclude:: ../includes/capsulethunk.h
Other options
=============
If you are writing a new extension module, you might consider `Cython
<http://cython.org/>`_. It translates a Python-like language to C. The
extension modules it creates are compatible with Python 3 and Python 2.

439
Doc/howto/curses.rst Normal file
View File

@@ -0,0 +1,439 @@
.. _curses-howto:
**********************************
Curses Programming with Python
**********************************
:Author: A.M. Kuchling, Eric S. Raymond
:Release: 2.03
.. topic:: Abstract
This document describes how to write text-mode programs with Python 2.x, using
the :mod:`curses` extension module to control the display.
What is curses?
===============
The curses library supplies a terminal-independent screen-painting and
keyboard-handling facility for text-based terminals; such terminals include
VT100s, the Linux console, and the simulated terminal provided by X11 programs
such as xterm and rxvt. Display terminals support various control codes to
perform common operations such as moving the cursor, scrolling the screen, and
erasing areas. Different terminals use widely differing codes, and often have
their own minor quirks.
In a world of X displays, one might ask "why bother"? It's true that
character-cell display terminals are an obsolete technology, but there are
niches in which being able to do fancy things with them are still valuable. One
is on small-footprint or embedded Unixes that don't carry an X server. Another
is for tools like OS installers and kernel configurators that may have to run
before X is available.
The curses library hides all the details of different terminals, and provides
the programmer with an abstraction of a display, containing multiple
non-overlapping windows. The contents of a window can be changed in various
ways---adding text, erasing it, changing its appearance---and the curses library
will automagically figure out what control codes need to be sent to the terminal
to produce the right output.
The curses library was originally written for BSD Unix; the later System V
versions of Unix from AT&T added many enhancements and new functions. BSD curses
is no longer maintained, having been replaced by ncurses, which is an
open-source implementation of the AT&T interface. If you're using an
open-source Unix such as Linux or FreeBSD, your system almost certainly uses
ncurses. Since most current commercial Unix versions are based on System V
code, all the functions described here will probably be available. The older
versions of curses carried by some proprietary Unixes may not support
everything, though.
No one has made a Windows port of the curses module. On a Windows platform, try
the Console module written by Fredrik Lundh. The Console module provides
cursor-addressable text output, plus full support for mouse and keyboard input,
and is available from http://effbot.org/zone/console-index.htm.
The Python curses module
------------------------
Thy Python module is a fairly simple wrapper over the C functions provided by
curses; if you're already familiar with curses programming in C, it's really
easy to transfer that knowledge to Python. The biggest difference is that the
Python interface makes things simpler, by merging different C functions such as
:func:`addstr`, :func:`mvaddstr`, :func:`mvwaddstr`, into a single
:meth:`addstr` method. You'll see this covered in more detail later.
This HOWTO is simply an introduction to writing text-mode programs with curses
and Python. It doesn't attempt to be a complete guide to the curses API; for
that, see the Python library guide's section on ncurses, and the C manual pages
for ncurses. It will, however, give you the basic ideas.
Starting and ending a curses application
========================================
Before doing anything, curses must be initialized. This is done by calling the
:func:`initscr` function, which will determine the terminal type, send any
required setup codes to the terminal, and create various internal data
structures. If successful, :func:`initscr` returns a window object representing
the entire screen; this is usually called ``stdscr``, after the name of the
corresponding C variable. ::
import curses
stdscr = curses.initscr()
Usually curses applications turn off automatic echoing of keys to the screen, in
order to be able to read keys and only display them under certain circumstances.
This requires calling the :func:`noecho` function. ::
curses.noecho()
Applications will also commonly need to react to keys instantly, without
requiring the Enter key to be pressed; this is called cbreak mode, as opposed to
the usual buffered input mode. ::
curses.cbreak()
Terminals usually return special keys, such as the cursor keys or navigation
keys such as Page Up and Home, as a multibyte escape sequence. While you could
write your application to expect such sequences and process them accordingly,
curses can do it for you, returning a special value such as
:const:`curses.KEY_LEFT`. To get curses to do the job, you'll have to enable
keypad mode. ::
stdscr.keypad(1)
Terminating a curses application is much easier than starting one. You'll need
to call ::
curses.nocbreak(); stdscr.keypad(0); curses.echo()
to reverse the curses-friendly terminal settings. Then call the :func:`endwin`
function to restore the terminal to its original operating mode. ::
curses.endwin()
A common problem when debugging a curses application is to get your terminal
messed up when the application dies without restoring the terminal to its
previous state. In Python this commonly happens when your code is buggy and
raises an uncaught exception. Keys are no longer echoed to the screen when
you type them, for example, which makes using the shell difficult.
In Python you can avoid these complications and make debugging much easier by
importing the :func:`curses.wrapper` function. It takes a callable and does
the initializations described above, also initializing colors if color support
is present. It then runs your provided callable and finally deinitializes
appropriately. The callable is called inside a try-catch clause which catches
exceptions, performs curses deinitialization, and then passes the exception
upwards. Thus, your terminal won't be left in a funny state on exception.
Windows and Pads
================
Windows are the basic abstraction in curses. A window object represents a
rectangular area of the screen, and supports various methods to display text,
erase it, allow the user to input strings, and so forth.
The ``stdscr`` object returned by the :func:`initscr` function is a window
object that covers the entire screen. Many programs may need only this single
window, but you might wish to divide the screen into smaller windows, in order
to redraw or clear them separately. The :func:`newwin` function creates a new
window of a given size, returning the new window object. ::
begin_x = 20; begin_y = 7
height = 5; width = 40
win = curses.newwin(height, width, begin_y, begin_x)
A word about the coordinate system used in curses: coordinates are always passed
in the order *y,x*, and the top-left corner of a window is coordinate (0,0).
This breaks a common convention for handling coordinates, where the *x*
coordinate usually comes first. This is an unfortunate difference from most
other computer applications, but it's been part of curses since it was first
written, and it's too late to change things now.
When you call a method to display or erase text, the effect doesn't immediately
show up on the display. This is because curses was originally written with slow
300-baud terminal connections in mind; with these terminals, minimizing the time
required to redraw the screen is very important. This lets curses accumulate
changes to the screen, and display them in the most efficient manner. For
example, if your program displays some characters in a window, and then clears
the window, there's no need to send the original characters because they'd never
be visible.
Accordingly, curses requires that you explicitly tell it to redraw windows,
using the :func:`refresh` method of window objects. In practice, this doesn't
really complicate programming with curses much. Most programs go into a flurry
of activity, and then pause waiting for a keypress or some other action on the
part of the user. All you have to do is to be sure that the screen has been
redrawn before pausing to wait for user input, by simply calling
``stdscr.refresh()`` or the :func:`refresh` method of some other relevant
window.
A pad is a special case of a window; it can be larger than the actual display
screen, and only a portion of it displayed at a time. Creating a pad simply
requires the pad's height and width, while refreshing a pad requires giving the
coordinates of the on-screen area where a subsection of the pad will be
displayed. ::
pad = curses.newpad(100, 100)
# These loops fill the pad with letters; this is
# explained in the next section
for y in range(0, 100):
for x in range(0, 100):
try:
pad.addch(y,x, ord('a') + (x*x+y*y) % 26)
except curses.error:
pass
# Displays a section of the pad in the middle of the screen
pad.refresh(0,0, 5,5, 20,75)
The :func:`refresh` call displays a section of the pad in the rectangle
extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper
left corner of the displayed section is coordinate (0,0) on the pad. Beyond
that difference, pads are exactly like ordinary windows and support the same
methods.
If you have multiple windows and pads on screen there is a more efficient way to
go, which will prevent annoying screen flicker at refresh time. Use the
:meth:`noutrefresh` method of each window to update the data structure
representing the desired state of the screen; then change the physical screen to
match the desired state in one go with the function :func:`doupdate`. The
normal :meth:`refresh` method calls :func:`doupdate` as its last act.
Displaying Text
===============
From a C programmer's point of view, curses may sometimes look like a twisty
maze of functions, all subtly different. For example, :func:`addstr` displays a
string at the current cursor location in the ``stdscr`` window, while
:func:`mvaddstr` moves to a given y,x coordinate first before displaying the
string. :func:`waddstr` is just like :func:`addstr`, but allows specifying a
window to use, instead of using ``stdscr`` by default. :func:`mvwaddstr` follows
similarly.
Fortunately the Python interface hides all these details; ``stdscr`` is a window
object like any other, and methods like :func:`addstr` accept multiple argument
forms. Usually there are four different forms.
+---------------------------------+-----------------------------------------------+
| Form | Description |
+=================================+===============================================+
| *str* or *ch* | Display the string *str* or character *ch* at |
| | the current position |
+---------------------------------+-----------------------------------------------+
| *str* or *ch*, *attr* | Display the string *str* or character *ch*, |
| | using attribute *attr* at the current |
| | position |
+---------------------------------+-----------------------------------------------+
| *y*, *x*, *str* or *ch* | Move to position *y,x* within the window, and |
| | display *str* or *ch* |
+---------------------------------+-----------------------------------------------+
| *y*, *x*, *str* or *ch*, *attr* | Move to position *y,x* within the window, and |
| | display *str* or *ch*, using attribute *attr* |
+---------------------------------+-----------------------------------------------+
Attributes allow displaying text in highlighted forms, such as in boldface,
underline, reverse code, or in color. They'll be explained in more detail in
the next subsection.
The :func:`addstr` function takes a Python string as the value to be displayed,
while the :func:`addch` functions take a character, which can be either a Python
string of length 1 or an integer. If it's a string, you're limited to
displaying characters between 0 and 255. SVr4 curses provides constants for
extension characters; these constants are integers greater than 255. For
example, :const:`ACS_PLMINUS` is a +/- symbol, and :const:`ACS_ULCORNER` is the
upper left corner of a box (handy for drawing borders).
Windows remember where the cursor was left after the last operation, so if you
leave out the *y,x* coordinates, the string or character will be displayed
wherever the last operation left off. You can also move the cursor with the
``move(y,x)`` method. Because some terminals always display a flashing cursor,
you may want to ensure that the cursor is positioned in some location where it
won't be distracting; it can be confusing to have the cursor blinking at some
apparently random location.
If your application doesn't need a blinking cursor at all, you can call
``curs_set(0)`` to make it invisible. Equivalently, and for compatibility with
older curses versions, there's a ``leaveok(bool)`` function. When *bool* is
true, the curses library will attempt to suppress the flashing cursor, and you
won't need to worry about leaving it in odd locations.
Attributes and Color
--------------------
Characters can be displayed in different ways. Status lines in a text-based
application are commonly shown in reverse video; a text viewer may need to
highlight certain words. curses supports this by allowing you to specify an
attribute for each cell on the screen.
An attribute is an integer, each bit representing a different attribute. You can
try to display text with multiple attribute bits set, but curses doesn't
guarantee that all the possible combinations are available, or that they're all
visually distinct. That depends on the ability of the terminal being used, so
it's safest to stick to the most commonly available attributes, listed here.
+----------------------+--------------------------------------+
| Attribute | Description |
+======================+======================================+
| :const:`A_BLINK` | Blinking text |
+----------------------+--------------------------------------+
| :const:`A_BOLD` | Extra bright or bold text |
+----------------------+--------------------------------------+
| :const:`A_DIM` | Half bright text |
+----------------------+--------------------------------------+
| :const:`A_REVERSE` | Reverse-video text |
+----------------------+--------------------------------------+
| :const:`A_STANDOUT` | The best highlighting mode available |
+----------------------+--------------------------------------+
| :const:`A_UNDERLINE` | Underlined text |
+----------------------+--------------------------------------+
So, to display a reverse-video status line on the top line of the screen, you
could code::
stdscr.addstr(0, 0, "Current mode: Typing mode",
curses.A_REVERSE)
stdscr.refresh()
The curses library also supports color on those terminals that provide it. The
most common such terminal is probably the Linux console, followed by color
xterms.
To use color, you must call the :func:`start_color` function soon after calling
:func:`initscr`, to initialize the default color set (the
:func:`curses.wrapper.wrapper` function does this automatically). Once that's
done, the :func:`has_colors` function returns TRUE if the terminal in use can
actually display color. (Note: curses uses the American spelling 'color',
instead of the Canadian/British spelling 'colour'. If you're used to the
British spelling, you'll have to resign yourself to misspelling it for the sake
of these functions.)
The curses library maintains a finite number of color pairs, containing a
foreground (or text) color and a background color. You can get the attribute
value corresponding to a color pair with the :func:`color_pair` function; this
can be bitwise-OR'ed with other attributes such as :const:`A_REVERSE`, but
again, such combinations are not guaranteed to work on all terminals.
An example, which displays a line of text using color pair 1::
stdscr.addstr("Pretty text", curses.color_pair(1))
stdscr.refresh()
As I said before, a color pair consists of a foreground and background color.
:func:`start_color` initializes 8 basic colors when it activates color mode.
They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and
7:white. The curses module defines named constants for each of these colors:
:const:`curses.COLOR_BLACK`, :const:`curses.COLOR_RED`, and so forth.
The ``init_pair(n, f, b)`` function changes the definition of color pair *n*, to
foreground color f and background color b. Color pair 0 is hard-wired to white
on black, and cannot be changed.
Let's put all this together. To change color 1 to red text on a white
background, you would call::
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
When you change a color pair, any text already displayed using that color pair
will change to the new colors. You can also display new text in this color
with::
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1))
Very fancy terminals can change the definitions of the actual colors to a given
RGB value. This lets you change color 1, which is usually red, to purple or
blue or any other color you like. Unfortunately, the Linux console doesn't
support this, so I'm unable to try it out, and can't provide any examples. You
can check if your terminal can do this by calling :func:`can_change_color`,
which returns TRUE if the capability is there. If you're lucky enough to have
such a talented terminal, consult your system's man pages for more information.
User Input
==========
The curses library itself offers only very simple input mechanisms. Python's
support adds a text-input widget that makes up some of the lack.
The most common way to get input to a window is to use its :meth:`getch` method.
:meth:`getch` pauses and waits for the user to hit a key, displaying it if
:func:`echo` has been called earlier. You can optionally specify a coordinate
to which the cursor should be moved before pausing.
It's possible to change this behavior with the method :meth:`nodelay`. After
``nodelay(1)``, :meth:`getch` for the window becomes non-blocking and returns
``curses.ERR`` (a value of -1) when no input is ready. There's also a
:func:`halfdelay` function, which can be used to (in effect) set a timer on each
:meth:`getch`; if no input becomes available within a specified
delay (measured in tenths of a second), curses raises an exception.
The :meth:`getch` method returns an integer; if it's between 0 and 255, it
represents the ASCII code of the key pressed. Values greater than 255 are
special keys such as Page Up, Home, or the cursor keys. You can compare the
value returned to constants such as :const:`curses.KEY_PPAGE`,
:const:`curses.KEY_HOME`, or :const:`curses.KEY_LEFT`. Usually the main loop of
your program will look something like this::
while 1:
c = stdscr.getch()
if c == ord('p'):
PrintDocument()
elif c == ord('q'):
break # Exit the while()
elif c == curses.KEY_HOME:
x = y = 0
The :mod:`curses.ascii` module supplies ASCII class membership functions that
take either integer or 1-character-string arguments; these may be useful in
writing more readable tests for your command interpreters. It also supplies
conversion functions that take either integer or 1-character-string arguments
and return the same type. For example, :func:`curses.ascii.ctrl` returns the
control character corresponding to its argument.
There's also a method to retrieve an entire string, :const:`getstr()`. It isn't
used very often, because its functionality is quite limited; the only editing
keys available are the backspace key and the Enter key, which terminates the
string. It can optionally be limited to a fixed number of characters. ::
curses.echo() # Enable echoing of characters
# Get a 15-character string, with the cursor on the top line
s = stdscr.getstr(0,0, 15)
The Python :mod:`curses.textpad` module supplies something better. With it, you
can turn a window into a text box that supports an Emacs-like set of
keybindings. Various methods of :class:`Textbox` class support editing with
input validation and gathering the edit results either with or without trailing
spaces. See the library documentation on :mod:`curses.textpad` for the
details.
For More Information
====================
This HOWTO didn't cover some advanced topics, such as screen-scraping or
capturing mouse events from an xterm instance. But the Python library page for
the curses modules is now pretty complete. You should browse it next.
If you're in doubt about the detailed behavior of any of the ncurses entry
points, consult the manual pages for your curses implementation, whether it's
ncurses or a proprietary Unix vendor's. The manual pages will document any
quirks, and provide complete lists of all the functions, attributes, and
:const:`ACS_\*` characters available to you.
Because the curses API is so large, some functions aren't supported in the
Python interface, not because they're difficult to implement, but because no one
has needed them yet. Feel free to add them and then submit a patch. Also, we
don't yet have support for the menu library associated with
ncurses; feel free to add that.
If you write an interesting little program, feel free to contribute it as
another demo. We can always use more of them!
The ncurses FAQ: http://invisible-island.net/ncurses/ncurses.faq.html

439
Doc/howto/descriptor.rst Normal file
View File

@@ -0,0 +1,439 @@
======================
Descriptor HowTo Guide
======================
:Author: Raymond Hettinger
:Contact: <python at rcn dot com>
.. Contents::
Abstract
--------
Defines descriptors, summarizes the protocol, and shows how descriptors are
called. Examines a custom descriptor and several built-in python descriptors
including functions, properties, static methods, and class methods. Shows how
each works by giving a pure Python equivalent and a sample application.
Learning about descriptors not only provides access to a larger toolset, it
creates a deeper understanding of how Python works and an appreciation for the
elegance of its design.
Definition and Introduction
---------------------------
In general, a descriptor is an object attribute with "binding behavior", one
whose attribute access has been overridden by methods in the descriptor
protocol. Those methods are :meth:`__get__`, :meth:`__set__`, and
:meth:`__delete__`. If any of those methods are defined for an object, it is
said to be a descriptor.
The default behavior for attribute access is to get, set, or delete the
attribute from an object's dictionary. For instance, ``a.x`` has a lookup chain
starting with ``a.__dict__['x']``, then ``type(a).__dict__['x']``, and
continuing through the base classes of ``type(a)`` excluding metaclasses. If the
looked-up value is an object defining one of the descriptor methods, then Python
may override the default behavior and invoke the descriptor method instead.
Where this occurs in the precedence chain depends on which descriptor methods
were defined. Note that descriptors are only invoked for new style objects or
classes (a class is new style if it inherits from :class:`object` or
:class:`type`).
Descriptors are a powerful, general purpose protocol. They are the mechanism
behind properties, methods, static methods, class methods, and :func:`super()`.
They are used throughout Python itself to implement the new style classes
introduced in version 2.2. Descriptors simplify the underlying C-code and offer
a flexible set of new tools for everyday Python programs.
Descriptor Protocol
-------------------
``descr.__get__(self, obj, type=None) --> value``
``descr.__set__(self, obj, value) --> None``
``descr.__delete__(self, obj) --> None``
That is all there is to it. Define any of these methods and an object is
considered a descriptor and can override default behavior upon being looked up
as an attribute.
If an object defines both :meth:`__get__` and :meth:`__set__`, it is considered
a data descriptor. Descriptors that only define :meth:`__get__` are called
non-data descriptors (they are typically used for methods but other uses are
possible).
Data and non-data descriptors differ in how overrides are calculated with
respect to entries in an instance's dictionary. If an instance's dictionary
has an entry with the same name as a data descriptor, the data descriptor
takes precedence. If an instance's dictionary has an entry with the same
name as a non-data descriptor, the dictionary entry takes precedence.
To make a read-only data descriptor, define both :meth:`__get__` and
:meth:`__set__` with the :meth:`__set__` raising an :exc:`AttributeError` when
called. Defining the :meth:`__set__` method with an exception raising
placeholder is enough to make it a data descriptor.
Invoking Descriptors
--------------------
A descriptor can be called directly by its method name. For example,
``d.__get__(obj)``.
Alternatively, it is more common for a descriptor to be invoked automatically
upon attribute access. For example, ``obj.d`` looks up ``d`` in the dictionary
of ``obj``. If ``d`` defines the method :meth:`__get__`, then ``d.__get__(obj)``
is invoked according to the precedence rules listed below.
The details of invocation depend on whether ``obj`` is an object or a class.
Either way, descriptors only work for new style objects and classes. A class is
new style if it is a subclass of :class:`object`.
For objects, the machinery is in :meth:`object.__getattribute__` which
transforms ``b.x`` into ``type(b).__dict__['x'].__get__(b, type(b))``. The
implementation works through a precedence chain that gives data descriptors
priority over instance variables, instance variables priority over non-data
descriptors, and assigns lowest priority to :meth:`__getattr__` if provided.
The full C implementation can be found in :c:func:`PyObject_GenericGetAttr()` in
:source:`Objects/object.c`.
For classes, the machinery is in :meth:`type.__getattribute__` which transforms
``B.x`` into ``B.__dict__['x'].__get__(None, B)``. In pure Python, it looks
like::
def __getattribute__(self, key):
"Emulate type_getattro() in Objects/typeobject.c"
v = object.__getattribute__(self, key)
if hasattr(v, '__get__'):
return v.__get__(None, self)
return v
The important points to remember are:
* descriptors are invoked by the :meth:`__getattribute__` method
* overriding :meth:`__getattribute__` prevents automatic descriptor calls
* :meth:`__getattribute__` is only available with new style classes and objects
* :meth:`object.__getattribute__` and :meth:`type.__getattribute__` make
different calls to :meth:`__get__`.
* data descriptors always override instance dictionaries.
* non-data descriptors may be overridden by instance dictionaries.
The object returned by ``super()`` also has a custom :meth:`__getattribute__`
method for invoking descriptors. The call ``super(B, obj).m()`` searches
``obj.__class__.__mro__`` for the base class ``A`` immediately following ``B``
and then returns ``A.__dict__['m'].__get__(obj, B)``. If not a descriptor,
``m`` is returned unchanged. If not in the dictionary, ``m`` reverts to a
search using :meth:`object.__getattribute__`.
Note, in Python 2.2, ``super(B, obj).m()`` would only invoke :meth:`__get__` if
``m`` was a data descriptor. In Python 2.3, non-data descriptors also get
invoked unless an old-style class is involved. The implementation details are
in :c:func:`super_getattro()` in :source:`Objects/typeobject.c`.
.. _`Guido's Tutorial`: https://www.python.org/download/releases/2.2.3/descrintro/#cooperation
The details above show that the mechanism for descriptors is embedded in the
:meth:`__getattribute__()` methods for :class:`object`, :class:`type`, and
:func:`super`. Classes inherit this machinery when they derive from
:class:`object` or if they have a meta-class providing similar functionality.
Likewise, classes can turn-off descriptor invocation by overriding
:meth:`__getattribute__()`.
Descriptor Example
------------------
The following code creates a class whose objects are data descriptors which
print a message for each get or set. Overriding :meth:`__getattribute__` is
alternate approach that could do this for every attribute. However, this
descriptor is useful for monitoring just a few chosen attributes::
class RevealAccess(object):
"""A data descriptor that sets and returns values
normally and prints a message logging their access.
"""
def __init__(self, initval=None, name='var'):
self.val = initval
self.name = name
def __get__(self, obj, objtype):
print 'Retrieving', self.name
return self.val
def __set__(self, obj, val):
print 'Updating', self.name
self.val = val
>>> class MyClass(object):
... x = RevealAccess(10, 'var "x"')
... y = 5
...
>>> m = MyClass()
>>> m.x
Retrieving var "x"
10
>>> m.x = 20
Updating var "x"
>>> m.x
Retrieving var "x"
20
>>> m.y
5
The protocol is simple and offers exciting possibilities. Several use cases are
so common that they have been packaged into individual function calls.
Properties, bound and unbound methods, static methods, and class methods are all
based on the descriptor protocol.
Properties
----------
Calling :func:`property` is a succinct way of building a data descriptor that
triggers function calls upon access to an attribute. Its signature is::
property(fget=None, fset=None, fdel=None, doc=None) -> property attribute
The documentation shows a typical use to define a managed attribute ``x``::
class C(object):
def getx(self): return self.__x
def setx(self, value): self.__x = value
def delx(self): del self.__x
x = property(getx, setx, delx, "I'm the 'x' property.")
To see how :func:`property` is implemented in terms of the descriptor protocol,
here is a pure Python equivalent::
class Property(object):
"Emulate PyProperty_Type() in Objects/descrobject.c"
def __init__(self, fget=None, fset=None, fdel=None, doc=None):
self.fget = fget
self.fset = fset
self.fdel = fdel
if doc is None and fget is not None:
doc = fget.__doc__
self.__doc__ = doc
def __get__(self, obj, objtype=None):
if obj is None:
return self
if self.fget is None:
raise AttributeError("unreadable attribute")
return self.fget(obj)
def __set__(self, obj, value):
if self.fset is None:
raise AttributeError("can't set attribute")
self.fset(obj, value)
def __delete__(self, obj):
if self.fdel is None:
raise AttributeError("can't delete attribute")
self.fdel(obj)
def getter(self, fget):
return type(self)(fget, self.fset, self.fdel, self.__doc__)
def setter(self, fset):
return type(self)(self.fget, fset, self.fdel, self.__doc__)
def deleter(self, fdel):
return type(self)(self.fget, self.fset, fdel, self.__doc__)
The :func:`property` builtin helps whenever a user interface has granted
attribute access and then subsequent changes require the intervention of a
method.
For instance, a spreadsheet class may grant access to a cell value through
``Cell('b10').value``. Subsequent improvements to the program require the cell
to be recalculated on every access; however, the programmer does not want to
affect existing client code accessing the attribute directly. The solution is
to wrap access to the value attribute in a property data descriptor::
class Cell(object):
. . .
def getvalue(self):
"Recalculate the cell before returning value"
self.recalc()
return self._value
value = property(getvalue)
Functions and Methods
---------------------
Python's object oriented features are built upon a function based environment.
Using non-data descriptors, the two are merged seamlessly.
Class dictionaries store methods as functions. In a class definition, methods
are written using :keyword:`def` and :keyword:`lambda`, the usual tools for
creating functions. The only difference from regular functions is that the
first argument is reserved for the object instance. By Python convention, the
instance reference is called *self* but may be called *this* or any other
variable name.
To support method calls, functions include the :meth:`__get__` method for
binding methods during attribute access. This means that all functions are
non-data descriptors which return bound or unbound methods depending whether
they are invoked from an object or a class. In pure python, it works like
this::
class Function(object):
. . .
def __get__(self, obj, objtype=None):
"Simulate func_descr_get() in Objects/funcobject.c"
return types.MethodType(self, obj, objtype)
Running the interpreter shows how the function descriptor works in practice::
>>> class D(object):
... def f(self, x):
... return x
...
>>> d = D()
>>> D.__dict__['f'] # Stored internally as a function
<function f at 0x00C45070>
>>> D.f # Get from a class becomes an unbound method
<unbound method D.f>
>>> d.f # Get from an instance becomes a bound method
<bound method D.f of <__main__.D object at 0x00B18C90>>
The output suggests that bound and unbound methods are two different types.
While they could have been implemented that way, the actual C implementation of
:c:type:`PyMethod_Type` in :source:`Objects/classobject.c` is a single object
with two different representations depending on whether the :attr:`im_self`
field is set or is *NULL* (the C equivalent of ``None``).
Likewise, the effects of calling a method object depend on the :attr:`im_self`
field. If set (meaning bound), the original function (stored in the
:attr:`im_func` field) is called as expected with the first argument set to the
instance. If unbound, all of the arguments are passed unchanged to the original
function. The actual C implementation of :func:`instancemethod_call()` is only
slightly more complex in that it includes some type checking.
Static Methods and Class Methods
--------------------------------
Non-data descriptors provide a simple mechanism for variations on the usual
patterns of binding functions into methods.
To recap, functions have a :meth:`__get__` method so that they can be converted
to a method when accessed as attributes. The non-data descriptor transforms an
``obj.f(*args)`` call into ``f(obj, *args)``. Calling ``klass.f(*args)``
becomes ``f(*args)``.
This chart summarizes the binding and its two most useful variants:
+-----------------+----------------------+------------------+
| Transformation | Called from an | Called from a |
| | Object | Class |
+=================+======================+==================+
| function | f(obj, \*args) | f(\*args) |
+-----------------+----------------------+------------------+
| staticmethod | f(\*args) | f(\*args) |
+-----------------+----------------------+------------------+
| classmethod | f(type(obj), \*args) | f(klass, \*args) |
+-----------------+----------------------+------------------+
Static methods return the underlying function without changes. Calling either
``c.f`` or ``C.f`` is the equivalent of a direct lookup into
``object.__getattribute__(c, "f")`` or ``object.__getattribute__(C, "f")``. As a
result, the function becomes identically accessible from either an object or a
class.
Good candidates for static methods are methods that do not reference the
``self`` variable.
For instance, a statistics package may include a container class for
experimental data. The class provides normal methods for computing the average,
mean, median, and other descriptive statistics that depend on the data. However,
there may be useful functions which are conceptually related but do not depend
on the data. For instance, ``erf(x)`` is handy conversion routine that comes up
in statistical work but does not directly depend on a particular dataset.
It can be called either from an object or the class: ``s.erf(1.5) --> .9332`` or
``Sample.erf(1.5) --> .9332``.
Since staticmethods return the underlying function with no changes, the example
calls are unexciting::
>>> class E(object):
... def f(x):
... print x
... f = staticmethod(f)
...
>>> print E.f(3)
3
>>> print E().f(3)
3
Using the non-data descriptor protocol, a pure Python version of
:func:`staticmethod` would look like this::
class StaticMethod(object):
"Emulate PyStaticMethod_Type() in Objects/funcobject.c"
def __init__(self, f):
self.f = f
def __get__(self, obj, objtype=None):
return self.f
Unlike static methods, class methods prepend the class reference to the
argument list before calling the function. This format is the same
for whether the caller is an object or a class::
>>> class E(object):
... def f(klass, x):
... return klass.__name__, x
... f = classmethod(f)
...
>>> print E.f(3)
('E', 3)
>>> print E().f(3)
('E', 3)
This behavior is useful whenever the function only needs to have a class
reference and does not care about any underlying data. One use for classmethods
is to create alternate class constructors. In Python 2.3, the classmethod
:func:`dict.fromkeys` creates a new dictionary from a list of keys. The pure
Python equivalent is::
class Dict(object):
. . .
def fromkeys(klass, iterable, value=None):
"Emulate dict_fromkeys() in Objects/dictobject.c"
d = klass()
for key in iterable:
d[key] = value
return d
fromkeys = classmethod(fromkeys)
Now a new dictionary of unique keys can be constructed like this::
>>> Dict.fromkeys('abracadabra')
{'a': None, 'r': None, 'b': None, 'c': None, 'd': None}
Using the non-data descriptor protocol, a pure Python version of
:func:`classmethod` would look like this::
class ClassMethod(object):
"Emulate PyClassMethod_Type() in Objects/funcobject.c"
def __init__(self, f):
self.f = f
def __get__(self, obj, klass=None):
if klass is None:
klass = type(obj)
def newfunc(*args):
return self.f(klass, *args)
return newfunc

327
Doc/howto/doanddont.rst Normal file
View File

@@ -0,0 +1,327 @@
************************************
Idioms and Anti-Idioms in Python
************************************
:Author: Moshe Zadka
This document is placed in the public domain.
.. topic:: Abstract
This document can be considered a companion to the tutorial. It shows how to use
Python, and even more importantly, how *not* to use Python.
Language Constructs You Should Not Use
======================================
While Python has relatively few gotchas compared to other languages, it still
has some constructs which are only useful in corner cases, or are plain
dangerous.
from module import \*
---------------------
Inside Function Definitions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
``from module import *`` is *invalid* inside function definitions. While many
versions of Python do not check for the invalidity, it does not make it more
valid, no more than having a smart lawyer makes a man innocent. Do not use it
like that ever. Even in versions where it was accepted, it made the function
execution slower, because the compiler could not be certain which names were
local and which were global. In Python 2.1 this construct causes warnings, and
sometimes even errors.
At Module Level
^^^^^^^^^^^^^^^
While it is valid to use ``from module import *`` at module level it is usually
a bad idea. For one, this loses an important property Python otherwise has ---
you can know where each toplevel name is defined by a simple "search" function
in your favourite editor. You also open yourself to trouble in the future, if
some module grows additional functions or classes.
One of the most awful questions asked on the newsgroup is why this code::
f = open("www")
f.read()
does not work. Of course, it works just fine (assuming you have a file called
"www".) But it does not work if somewhere in the module, the statement ``from
os import *`` is present. The :mod:`os` module has a function called
:func:`open` which returns an integer. While it is very useful, shadowing a
builtin is one of its least useful properties.
Remember, you can never know for sure what names a module exports, so either
take what you need --- ``from module import name1, name2``, or keep them in the
module and access on a per-need basis --- ``import module;print module.name``.
When It Is Just Fine
^^^^^^^^^^^^^^^^^^^^
There are situations in which ``from module import *`` is just fine:
* The interactive prompt. For example, ``from math import *`` makes Python an
amazing scientific calculator.
* When extending a module in C with a module in Python.
* When the module advertises itself as ``from import *`` safe.
Unadorned :keyword:`exec`, :func:`execfile` and friends
-------------------------------------------------------
The word "unadorned" refers to the use without an explicit dictionary, in which
case those constructs evaluate code in the *current* environment. This is
dangerous for the same reasons ``from import *`` is dangerous --- it might step
over variables you are counting on and mess up things for the rest of your code.
Simply do not do that.
Bad examples::
>>> for name in sys.argv[1:]:
>>> exec "%s=1" % name
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> exec "s.%s=val" % var # invalid!
>>> execfile("handler.py")
>>> handle()
Good examples::
>>> d = {}
>>> for name in sys.argv[1:]:
>>> d[name] = 1
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> setattr(s, var, val)
>>> d={}
>>> execfile("handle.py", d, d)
>>> handle = d['handle']
>>> handle()
from module import name1, name2
-------------------------------
This is a "don't" which is much weaker than the previous "don't"s but is still
something you should not do if you don't have good reasons to do that. The
reason it is usually a bad idea is because you suddenly have an object which lives
in two separate namespaces. When the binding in one namespace changes, the
binding in the other will not, so there will be a discrepancy between them. This
happens when, for example, one module is reloaded, or changes the definition of
a function at runtime.
Bad example::
# foo.py
a = 1
# bar.py
from foo import a
if something():
a = 2 # danger: foo.a != a
Good example::
# foo.py
a = 1
# bar.py
import foo
if something():
foo.a = 2
except:
-------
Python has the ``except:`` clause, which catches all exceptions. Since *every*
error in Python raises an exception, using ``except:`` can make many
programming errors look like runtime problems, which hinders the debugging
process.
The following code shows a great example of why this is bad::
try:
foo = opne("file") # misspelled "open"
except:
sys.exit("could not open file!")
The second line triggers a :exc:`NameError`, which is caught by the except
clause. The program will exit, and the error message the program prints will
make you think the problem is the readability of ``"file"`` when in fact
the real error has nothing to do with ``"file"``.
A better way to write the above is ::
try:
foo = opne("file")
except IOError:
sys.exit("could not open file")
When this is run, Python will produce a traceback showing the :exc:`NameError`,
and it will be immediately apparent what needs to be fixed.
.. index:: bare except, except; bare
Because ``except:`` catches *all* exceptions, including :exc:`SystemExit`,
:exc:`KeyboardInterrupt`, and :exc:`GeneratorExit` (which is not an error and
should not normally be caught by user code), using a bare ``except:`` is almost
never a good idea. In situations where you need to catch all "normal" errors,
such as in a framework that runs callbacks, you can catch the base class for
all normal exceptions, :exc:`Exception`. Unfortunately in Python 2.x it is
possible for third-party code to raise exceptions that do not inherit from
:exc:`Exception`, so in Python 2.x there are some cases where you may have to
use a bare ``except:`` and manually re-raise the exceptions you don't want
to catch.
Exceptions
==========
Exceptions are a useful feature of Python. You should learn to raise them
whenever something unexpected occurs, and catch them only where you can do
something about them.
The following is a very popular anti-idiom ::
def get_status(file):
if not os.path.exists(file):
print "file not found"
sys.exit(1)
return open(file).readline()
Consider the case where the file gets deleted between the time the call to
:func:`os.path.exists` is made and the time :func:`open` is called. In that
case the last line will raise an :exc:`IOError`. The same thing would happen
if *file* exists but has no read permission. Since testing this on a normal
machine on existent and non-existent files makes it seem bugless, the test
results will seem fine, and the code will get shipped. Later an unhandled
:exc:`IOError` (or perhaps some other :exc:`EnvironmentError`) escapes to the
user, who gets to watch the ugly traceback.
Here is a somewhat better way to do it. ::
def get_status(file):
try:
return open(file).readline()
except EnvironmentError as err:
print "Unable to open file: {}".format(err)
sys.exit(1)
In this version, *either* the file gets opened and the line is read (so it
works even on flaky NFS or SMB connections), or an error message is printed
that provides all the available information on why the open failed, and the
application is aborted.
However, even this version of :func:`get_status` makes too many assumptions ---
that it will only be used in a short running script, and not, say, in a long
running server. Sure, the caller could do something like ::
try:
status = get_status(log)
except SystemExit:
status = None
But there is a better way. You should try to use as few ``except`` clauses in
your code as you can --- the ones you do use will usually be inside calls which
should always succeed, or a catch-all in a main function.
So, an even better version of :func:`get_status()` is probably ::
def get_status(file):
return open(file).readline()
The caller can deal with the exception if it wants (for example, if it tries
several files in a loop), or just let the exception filter upwards to *its*
caller.
But the last version still has a serious problem --- due to implementation
details in CPython, the file would not be closed when an exception is raised
until the exception handler finishes; and, worse, in other implementations
(e.g., Jython) it might not be closed at all regardless of whether or not
an exception is raised.
The best version of this function uses the ``open()`` call as a context
manager, which will ensure that the file gets closed as soon as the
function returns::
def get_status(file):
with open(file) as fp:
return fp.readline()
Using the Batteries
===================
Every so often, people seem to be writing stuff in the Python library again,
usually poorly. While the occasional module has a poor interface, it is usually
much better to use the rich standard library and data types that come with
Python than inventing your own.
A useful module very few people know about is :mod:`os.path`. It always has the
correct path arithmetic for your operating system, and will usually be much
better than whatever you come up with yourself.
Compare::
# ugh!
return dir+"/"+file
# better
return os.path.join(dir, file)
More useful functions in :mod:`os.path`: :func:`basename`, :func:`dirname` and
:func:`splitext`.
There are also many useful built-in functions people seem not to be aware of
for some reason: :func:`min` and :func:`max` can find the minimum/maximum of
any sequence with comparable semantics, for example, yet many people write
their own :func:`max`/:func:`min`. Another highly useful function is
:func:`reduce` which can be used to repeatedly apply a binary operation to a
sequence, reducing it to a single value. For example, compute a factorial
with a series of multiply operations::
>>> n = 4
>>> import operator
>>> reduce(operator.mul, range(1, n+1))
24
When it comes to parsing numbers, note that :func:`float`, :func:`int` and
:func:`long` all accept string arguments and will reject ill-formed strings
by raising an :exc:`ValueError`.
Using Backslash to Continue Statements
======================================
Since Python treats a newline as a statement terminator, and since statements
are often more than is comfortable to put in one line, many people do::
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
calculate_number(10, 20) != forbulate(500, 360):
pass
You should realize that this is dangerous: a stray space after the ``\`` would
make this line wrong, and stray spaces are notoriously hard to see in editors.
In this case, at least it would be a syntax error, but if the code was::
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
+ calculate_number(10, 20)*forbulate(500, 360)
then it would just be subtly wrong.
It is usually much better to use the implicit continuation inside parenthesis:
This version is bulletproof::
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
+ calculate_number(10, 20)*forbulate(500, 360))

1250
Doc/howto/functional.rst Normal file

File diff suppressed because it is too large Load Diff

31
Doc/howto/index.rst Normal file
View File

@@ -0,0 +1,31 @@
***************
Python HOWTOs
***************
Python HOWTOs are documents that cover a single, specific topic,
and attempt to cover it fairly completely. Modelled on the Linux
Documentation Project's HOWTO collection, this collection is an
effort to foster documentation that's more detailed than the
Python Library Reference.
Currently, the HOWTOs are:
.. toctree::
:maxdepth: 1
pyporting.rst
cporting.rst
curses.rst
descriptor.rst
doanddont.rst
functional.rst
logging.rst
logging-cookbook.rst
regex.rst
sockets.rst
sorting.rst
unicode.rst
urllib2.rst
webservers.rst
argparse.rst

File diff suppressed because it is too large Load Diff

1051
Doc/howto/logging.rst Normal file

File diff suppressed because it is too large Load Diff

BIN
Doc/howto/logging_flow.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

452
Doc/howto/pyporting.rst Normal file
View File

@@ -0,0 +1,452 @@
.. _pyporting-howto:
*********************************
Porting Python 2 Code to Python 3
*********************************
:author: Brett Cannon
.. topic:: Abstract
With Python 3 being the future of Python while Python 2 is still in active
use, it is good to have your project available for both major releases of
Python. This guide is meant to help you figure out how best to support both
Python 2 & 3 simultaneously.
If you are looking to port an extension module instead of pure Python code,
please see :ref:`cporting-howto`.
If you would like to read one core Python developer's take on why Python 3
came into existence, you can read Nick Coghlan's `Python 3 Q & A`_ or
Brett Cannon's `Why Python 3 exists`_.
For help with porting, you can email the python-porting_ mailing list with
questions.
The Short Explanation
=====================
To make your project be single-source Python 2/3 compatible, the basic steps
are:
#. Only worry about supporting Python 2.7
#. Make sure you have good test coverage (coverage.py_ can help;
``pip install coverage``)
#. Learn the differences between Python 2 & 3
#. Use Futurize_ (or Modernize_) to update your code (e.g. ``pip install future``)
#. Use Pylint_ to help make sure you don't regress on your Python 3 support
(``pip install pylint``)
#. Use caniusepython3_ to find out which of your dependencies are blocking your
use of Python 3 (``pip install caniusepython3``)
#. Once your dependencies are no longer blocking you, use continuous integration
to make sure you stay compatible with Python 2 & 3 (tox_ can help test
against multiple versions of Python; ``pip install tox``)
#. Consider using optional static type checking to make sure your type usage
works in both Python 2 & 3 (e.g. use mypy_ to check your typing under both
Python 2 & Python 3).
Details
=======
A key point about supporting Python 2 & 3 simultaneously is that you can start
**today**! Even if your dependencies are not supporting Python 3 yet that does
not mean you can't modernize your code **now** to support Python 3. Most changes
required to support Python 3 lead to cleaner code using newer practices even in
Python 2 code.
Another key point is that modernizing your Python 2 code to also support
Python 3 is largely automated for you. While you might have to make some API
decisions thanks to Python 3 clarifying text data versus binary data, the
lower-level work is now mostly done for you and thus can at least benefit from
the automated changes immediately.
Keep those key points in mind while you read on about the details of porting
your code to support Python 2 & 3 simultaneously.
Drop support for Python 2.6 and older
-------------------------------------
While you can make Python 2.5 work with Python 3, it is **much** easier if you
only have to work with Python 2.7. If dropping Python 2.5 is not an
option then the six_ project can help you support Python 2.5 & 3 simultaneously
(``pip install six``). Do realize, though, that nearly all the projects listed
in this HOWTO will not be available to you.
If you are able to skip Python 2.5 and older, then the required changes
to your code should continue to look and feel like idiomatic Python code. At
worst you will have to use a function instead of a method in some instances or
have to import a function instead of using a built-in one, but otherwise the
overall transformation should not feel foreign to you.
But you should aim for only supporting Python 2.7. Python 2.6 is no longer
freely supported and thus is not receiving bugfixes. This means **you** will have
to work around any issues you come across with Python 2.6. There are also some
tools mentioned in this HOWTO which do not support Python 2.6 (e.g., Pylint_),
and this will become more commonplace as time goes on. It will simply be easier
for you if you only support the versions of Python that you have to support.
Make sure you specify the proper version support in your ``setup.py`` file
--------------------------------------------------------------------------
In your ``setup.py`` file you should have the proper `trove classifier`_
specifying what versions of Python you support. As your project does not support
Python 3 yet you should at least have
``Programming Language :: Python :: 2 :: Only`` specified. Ideally you should
also specify each major/minor version of Python that you do support, e.g.
``Programming Language :: Python :: 2.7``.
Have good test coverage
-----------------------
Once you have your code supporting the oldest version of Python 2 you want it
to, you will want to make sure your test suite has good coverage. A good rule of
thumb is that if you want to be confident enough in your test suite that any
failures that appear after having tools rewrite your code are actual bugs in the
tools and not in your code. If you want a number to aim for, try to get over 80%
coverage (and don't feel bad if you find it hard to get better than 90%
coverage). If you don't already have a tool to measure test coverage then
coverage.py_ is recommended.
Learn the differences between Python 2 & 3
-------------------------------------------
Once you have your code well-tested you are ready to begin porting your code to
Python 3! But to fully understand how your code is going to change and what
you want to look out for while you code, you will want to learn what changes
Python 3 makes in terms of Python 2. Typically the two best ways of doing that
is reading the `"What's New"`_ doc for each release of Python 3 and the
`Porting to Python 3`_ book (which is free online). There is also a handy
`cheat sheet`_ from the Python-Future project.
Update your code
----------------
Once you feel like you know what is different in Python 3 compared to Python 2,
it's time to update your code! You have a choice between two tools in porting
your code automatically: Futurize_ and Modernize_. Which tool you choose will
depend on how much like Python 3 you want your code to be. Futurize_ does its
best to make Python 3 idioms and practices exist in Python 2, e.g. backporting
the ``bytes`` type from Python 3 so that you have semantic parity between the
major versions of Python. Modernize_,
on the other hand, is more conservative and targets a Python 2/3 subset of
Python, directly relying on six_ to help provide compatibility. As Python 3 is
the future, it might be best to consider Futurize to begin adjusting to any new
practices that Python 3 introduces which you are not accustomed to yet.
Regardless of which tool you choose, they will update your code to run under
Python 3 while staying compatible with the version of Python 2 you started with.
Depending on how conservative you want to be, you may want to run the tool over
your test suite first and visually inspect the diff to make sure the
transformation is accurate. After you have transformed your test suite and
verified that all the tests still pass as expected, then you can transform your
application code knowing that any tests which fail is a translation failure.
Unfortunately the tools can't automate everything to make your code work under
Python 3 and so there are a handful of things you will need to update manually
to get full Python 3 support (which of these steps are necessary vary between
the tools). Read the documentation for the tool you choose to use to see what it
fixes by default and what it can do optionally to know what will (not) be fixed
for you and what you may have to fix on your own (e.g. using ``io.open()`` over
the built-in ``open()`` function is off by default in Modernize). Luckily,
though, there are only a couple of things to watch out for which can be
considered large issues that may be hard to debug if not watched for.
Division
++++++++
In Python 3, ``5 / 2 == 2.5`` and not ``2``; all division between ``int`` values
result in a ``float``. This change has actually been planned since Python 2.2
which was released in 2002. Since then users have been encouraged to add
``from __future__ import division`` to any and all files which use the ``/`` and
``//`` operators or to be running the interpreter with the ``-Q`` flag. If you
have not been doing this then you will need to go through your code and do two
things:
#. Add ``from __future__ import division`` to your files
#. Update any division operator as necessary to either use ``//`` to use floor
division or continue using ``/`` and expect a float
The reason that ``/`` isn't simply translated to ``//`` automatically is that if
an object defines a ``__truediv__`` method but not ``__floordiv__`` then your
code would begin to fail (e.g. a user-defined class that uses ``/`` to
signify some operation but not ``//`` for the same thing or at all).
Text versus binary data
+++++++++++++++++++++++
In Python 2 you could use the ``str`` type for both text and binary data.
Unfortunately this confluence of two different concepts could lead to brittle
code which sometimes worked for either kind of data, sometimes not. It also
could lead to confusing APIs if people didn't explicitly state that something
that accepted ``str`` accepted either text or binary data instead of one
specific type. This complicated the situation especially for anyone supporting
multiple languages as APIs wouldn't bother explicitly supporting ``unicode``
when they claimed text data support.
To make the distinction between text and binary data clearer and more
pronounced, Python 3 did what most languages created in the age of the internet
have done and made text and binary data distinct types that cannot blindly be
mixed together (Python predates widespread access to the internet). For any code
that deals only with text or only binary data, this separation doesn't pose an
issue. But for code that has to deal with both, it does mean you might have to
now care about when you are using text compared to binary data, which is why
this cannot be entirely automated.
To start, you will need to decide which APIs take text and which take binary
(it is **highly** recommended you don't design APIs that can take both due to
the difficulty of keeping the code working; as stated earlier it is difficult to
do well). In Python 2 this means making sure the APIs that take text can work
with ``unicode`` and those that work with binary data work with the
``bytes`` type from Python 3 (which is a subset of ``str`` in Python 2 and acts
as an alias for ``bytes`` type in Python 2). Usually the biggest issue is
realizing which methods exist on which types in Python 2 & 3 simultaneously
(for text that's ``unicode`` in Python 2 and ``str`` in Python 3, for binary
that's ``str``/``bytes`` in Python 2 and ``bytes`` in Python 3). The following
table lists the **unique** methods of each data type across Python 2 & 3
(e.g., the ``decode()`` method is usable on the equivalent binary data type in
either Python 2 or 3, but it can't be used by the textual data type consistently
between Python 2 and 3 because ``str`` in Python 3 doesn't have the method). Do
note that as of Python 3.5 the ``__mod__`` method was added to the bytes type.
======================== =====================
**Text data** **Binary data**
------------------------ ---------------------
\ decode
------------------------ ---------------------
encode
------------------------ ---------------------
format
------------------------ ---------------------
isdecimal
------------------------ ---------------------
isnumeric
======================== =====================
Making the distinction easier to handle can be accomplished by encoding and
decoding between binary data and text at the edge of your code. This means that
when you receive text in binary data, you should immediately decode it. And if
your code needs to send text as binary data then encode it as late as possible.
This allows your code to work with only text internally and thus eliminates
having to keep track of what type of data you are working with.
The next issue is making sure you know whether the string literals in your code
represent text or binary data. You should add a ``b`` prefix to any
literal that presents binary data. For text you should add a ``u`` prefix to
the text literal. (there is a :mod:`__future__` import to force all unspecified
literals to be Unicode, but usage has shown it isn't as effective as adding a
``b`` or ``u`` prefix to all literals explicitly)
As part of this dichotomy you also need to be careful about opening files.
Unless you have been working on Windows, there is a chance you have not always
bothered to add the ``b`` mode when opening a binary file (e.g., ``rb`` for
binary reading). Under Python 3, binary files and text files are clearly
distinct and mutually incompatible; see the :mod:`io` module for details.
Therefore, you **must** make a decision of whether a file will be used for
binary access (allowing binary data to be read and/or written) or textual access
(allowing text data to be read and/or written). You should also use :func:`io.open`
for opening files instead of the built-in :func:`open` function as the :mod:`io`
module is consistent from Python 2 to 3 while the built-in :func:`open` function
is not (in Python 3 it's actually :func:`io.open`). Do not bother with the
outdated practice of using :func:`codecs.open` as that's only necessary for
keeping compatibility with Python 2.5.
The constructors of both ``str`` and ``bytes`` have different semantics for the
same arguments between Python 2 & 3. Passing an integer to ``bytes`` in Python 2
will give you the string representation of the integer: ``bytes(3) == '3'``.
But in Python 3, an integer argument to ``bytes`` will give you a bytes object
as long as the integer specified, filled with null bytes:
``bytes(3) == b'\x00\x00\x00'``. A similar worry is necessary when passing a
bytes object to ``str``. In Python 2 you just get the bytes object back:
``str(b'3') == b'3'``. But in Python 3 you get the string representation of the
bytes object: ``str(b'3') == "b'3'"``.
Finally, the indexing of binary data requires careful handling (slicing does
**not** require any special handling). In Python 2,
``b'123'[1] == b'2'`` while in Python 3 ``b'123'[1] == 50``. Because binary data
is simply a collection of binary numbers, Python 3 returns the integer value for
the byte you index on. But in Python 2 because ``bytes == str``, indexing
returns a one-item slice of bytes. The six_ project has a function
named ``six.indexbytes()`` which will return an integer like in Python 3:
``six.indexbytes(b'123', 1)``.
To summarize:
#. Decide which of your APIs take text and which take binary data
#. Make sure that your code that works with text also works with ``unicode`` and
code for binary data works with ``bytes`` in Python 2 (see the table above
for what methods you cannot use for each type)
#. Mark all binary literals with a ``b`` prefix, textual literals with a ``u``
prefix
#. Decode binary data to text as soon as possible, encode text as binary data as
late as possible
#. Open files using :func:`io.open` and make sure to specify the ``b`` mode when
appropriate
#. Be careful when indexing into binary data
Use feature detection instead of version detection
++++++++++++++++++++++++++++++++++++++++++++++++++
Inevitably you will have code that has to choose what to do based on what
version of Python is running. The best way to do this is with feature detection
of whether the version of Python you're running under supports what you need.
If for some reason that doesn't work then you should make the version check be
against Python 2 and not Python 3. To help explain this, let's look at an
example.
Let's pretend that you need access to a feature of importlib_ that
is available in Python's standard library since Python 3.3 and available for
Python 2 through importlib2_ on PyPI. You might be tempted to write code to
access e.g. the ``importlib.abc`` module by doing the following::
import sys
if sys.version_info[0] == 3:
from importlib import abc
else:
from importlib2 import abc
The problem with this code is what happens when Python 4 comes out? It would
be better to treat Python 2 as the exceptional case instead of Python 3 and
assume that future Python versions will be more compatible with Python 3 than
Python 2::
import sys
if sys.version_info[0] > 2:
from importlib import abc
else:
from importlib2 import abc
The best solution, though, is to do no version detection at all and instead rely
on feature detection. That avoids any potential issues of getting the version
detection wrong and helps keep you future-compatible::
try:
from importlib import abc
except ImportError:
from importlib2 import abc
Prevent compatibility regressions
---------------------------------
Once you have fully translated your code to be compatible with Python 3, you
will want to make sure your code doesn't regress and stop working under
Python 3. This is especially true if you have a dependency which is blocking you
from actually running under Python 3 at the moment.
To help with staying compatible, any new modules you create should have
at least the following block of code at the top of it::
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
You can also run Python 2 with the ``-3`` flag to be warned about various
compatibility issues your code triggers during execution. If you turn warnings
into errors with ``-Werror`` then you can make sure that you don't accidentally
miss a warning.
You can also use the Pylint_ project and its ``--py3k`` flag to lint your code
to receive warnings when your code begins to deviate from Python 3
compatibility. This also prevents you from having to run Modernize_ or Futurize_
over your code regularly to catch compatibility regressions. This does require
you only support Python 2.7 and Python 3.4 or newer as that is Pylint's
minimum Python version support.
Check which dependencies block your transition
----------------------------------------------
**After** you have made your code compatible with Python 3 you should begin to
care about whether your dependencies have also been ported. The caniusepython3_
project was created to help you determine which projects
-- directly or indirectly -- are blocking you from supporting Python 3. There
is both a command-line tool as well as a web interface at
https://caniusepython3.com.
The project also provides code which you can integrate into your test suite so
that you will have a failing test when you no longer have dependencies blocking
you from using Python 3. This allows you to avoid having to manually check your
dependencies and to be notified quickly when you can start running on Python 3.
Update your ``setup.py`` file to denote Python 3 compatibility
--------------------------------------------------------------
Once your code works under Python 3, you should update the classifiers in
your ``setup.py`` to contain ``Programming Language :: Python :: 3`` and to not
specify sole Python 2 support. This will tell anyone using your code that you
support Python 2 **and** 3. Ideally you will also want to add classifiers for
each major/minor version of Python you now support.
Use continuous integration to stay compatible
---------------------------------------------
Once you are able to fully run under Python 3 you will want to make sure your
code always works under both Python 2 & 3. Probably the best tool for running
your tests under multiple Python interpreters is tox_. You can then integrate
tox with your continuous integration system so that you never accidentally break
Python 2 or 3 support.
You may also want to use the ``-bb`` flag with the Python 3 interpreter to
trigger an exception when you are comparing bytes to strings or bytes to an int
(the latter is available starting in Python 3.5). By default type-differing
comparisons simply return ``False``, but if you made a mistake in your
separation of text/binary data handling or indexing on bytes you wouldn't easily
find the mistake. This flag will raise an exception when these kinds of
comparisons occur, making the mistake much easier to track down.
And that's mostly it! At this point your code base is compatible with both
Python 2 and 3 simultaneously. Your testing will also be set up so that you
don't accidentally break Python 2 or 3 compatibility regardless of which version
you typically run your tests under while developing.
Consider using optional static type checking
--------------------------------------------
Another way to help port your code is to use a static type checker like
mypy_ or pytype_ on your code. These tools can be used to analyze your code as
if it's being run under Python 2, then you can run the tool a second time as if
your code is running under Python 3. By running a static type checker twice like
this you can discover if you're e.g. misusing binary data type in one version
of Python compared to another. If you add optional type hints to your code you
can also explicitly state whether your APIs use textual or binary data, helping
to make sure everything functions as expected in both versions of Python.
.. _2to3: https://docs.python.org/3/library/2to3.html
.. _caniusepython3: https://pypi.org/project/caniusepython3
.. _cheat sheet: http://python-future.org/compatible_idioms.html
.. _coverage.py: https://pypi.org/project/coverage
.. _Futurize: http://python-future.org/automatic_conversion.html
.. _importlib: https://docs.python.org/3/library/importlib.html#module-importlib
.. _importlib2: https://pypi.org/project/importlib2
.. _Modernize: https://python-modernize.readthedocs.org/en/latest/
.. _mypy: http://mypy-lang.org/
.. _Porting to Python 3: http://python3porting.com/
.. _Pylint: https://pypi.org/project/pylint
.. _Python 3 Q & A: https://ncoghlan-devs-python-notes.readthedocs.org/en/latest/python3/questions_and_answers.html
.. _pytype: https://github.com/google/pytype
.. _python-future: http://python-future.org/
.. _python-porting: https://mail.python.org/mailman/listinfo/python-porting
.. _six: https://pypi.org/project/six
.. _tox: https://pypi.org/project/tox
.. _trove classifier: https://pypi.org/classifiers
.. _"What's New": https://docs.python.org/3/whatsnew/index.html
.. _Why Python 3 exists: http://www.snarky.ca/why-python-3-exists

1379
Doc/howto/regex.rst Normal file

File diff suppressed because it is too large Load Diff

418
Doc/howto/sockets.rst Normal file
View File

@@ -0,0 +1,418 @@
.. _socket-howto:
****************************
Socket Programming HOWTO
****************************
:Author: Gordon McMillan
.. topic:: Abstract
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of sockets.
It's not really a tutorial - you'll still have work to do in getting things
operational. It doesn't cover the fine points (and there are a lot of them), but
I hope it will give you enough background to begin using them decently.
Sockets
=======
I'm only going to talk about INET sockets, but they account for at least 99% of
the sockets in use. And I'll only talk about STREAM sockets - unless you really
know what you're doing (in which case this HOWTO isn't for you!), you'll get
better behavior and performance from a STREAM socket than anything else. I will
try to clear up the mystery of what a socket is, as well as some hints on how to
work with blocking and non-blocking sockets. But I'll start by talking about
blocking sockets. You'll need to know how they work before dealing with
non-blocking sockets.
Part of the trouble with understanding these things is that "socket" can mean a
number of subtly different things, depending on context. So first, let's make a
distinction between a "client" socket - an endpoint of a conversation, and a
"server" socket, which is more like a switchboard operator. The client
application (your browser, for example) uses "client" sockets exclusively; the
web server it's talking to uses both "server" sockets and "client" sockets.
History
-------
Of the various forms of :abbr:`IPC (Inter Process Communication)`,
sockets are by far the most popular. On any given platform, there are
likely to be other forms of IPC that are faster, but for
cross-platform communication, sockets are about the only game in town.
They were invented in Berkeley as part of the BSD flavor of Unix. They spread
like wildfire with the Internet. With good reason --- the combination of sockets
with INET makes talking to arbitrary machines around the world unbelievably easy
(at least compared to other schemes).
Creating a Socket
=================
Roughly speaking, when you clicked on the link that brought you to this page,
your browser did something like the following::
#create an INET, STREAMing socket
s = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#now connect to the web server on port 80
# - the normal http port
s.connect(("www.mcmillan-inc.com", 80))
When the ``connect`` completes, the socket ``s`` can be used to send
in a request for the text of the page. The same socket will read the
reply, and then be destroyed. That's right, destroyed. Client sockets
are normally only used for one exchange (or a small set of sequential
exchanges).
What happens in the web server is a bit more complex. First, the web server
creates a "server socket"::
#create an INET, STREAMing socket
serversocket = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#bind the socket to a public host,
# and a well-known port
serversocket.bind((socket.gethostname(), 80))
#become a server socket
serversocket.listen(5)
A couple things to notice: we used ``socket.gethostname()`` so that the socket
would be visible to the outside world. If we had used ``s.bind(('localhost',
80))`` or ``s.bind(('127.0.0.1', 80))`` we would still have a "server" socket,
but one that was only visible within the same machine. ``s.bind(('', 80))``
specifies that the socket is reachable by any address the machine happens to
have.
A second thing to note: low number ports are usually reserved for "well known"
services (HTTP, SNMP etc). If you're playing around, use a nice high number (4
digits).
Finally, the argument to ``listen`` tells the socket library that we want it to
queue up as many as 5 connect requests (the normal max) before refusing outside
connections. If the rest of the code is written properly, that should be plenty.
Now that we have a "server" socket, listening on port 80, we can enter the
mainloop of the web server::
while 1:
#accept connections from outside
(clientsocket, address) = serversocket.accept()
#now do something with the clientsocket
#in this case, we'll pretend this is a threaded server
ct = client_thread(clientsocket)
ct.run()
There's actually 3 general ways in which this loop could work - dispatching a
thread to handle ``clientsocket``, create a new process to handle
``clientsocket``, or restructure this app to use non-blocking sockets, and
multiplex between our "server" socket and any active ``clientsocket``\ s using
``select``. More about that later. The important thing to understand now is
this: this is *all* a "server" socket does. It doesn't send any data. It doesn't
receive any data. It just produces "client" sockets. Each ``clientsocket`` is
created in response to some *other* "client" socket doing a ``connect()`` to the
host and port we're bound to. As soon as we've created that ``clientsocket``, we
go back to listening for more connections. The two "clients" are free to chat it
up - they are using some dynamically allocated port which will be recycled when
the conversation ends.
IPC
---
If you need fast IPC between two processes on one machine, you should look into
whatever form of shared memory the platform offers. A simple protocol based
around shared memory and locks or semaphores is by far the fastest technique.
If you do decide to use sockets, bind the "server" socket to ``'localhost'``. On
most platforms, this will take a shortcut around a couple of layers of network
code and be quite a bit faster.
Using a Socket
==============
The first thing to note, is that the web browser's "client" socket and the web
server's "client" socket are identical beasts. That is, this is a "peer to peer"
conversation. Or to put it another way, *as the designer, you will have to
decide what the rules of etiquette are for a conversation*. Normally, the
``connect``\ ing socket starts the conversation, by sending in a request, or
perhaps a signon. But that's a design decision - it's not a rule of sockets.
Now there are two sets of verbs to use for communication. You can use ``send``
and ``recv``, or you can transform your client socket into a file-like beast and
use ``read`` and ``write``. The latter is the way Java presents its sockets.
I'm not going to talk about it here, except to warn you that you need to use
``flush`` on sockets. These are buffered "files", and a common mistake is to
``write`` something, and then ``read`` for a reply. Without a ``flush`` in
there, you may wait forever for the reply, because the request may still be in
your output buffer.
Now we come to the major stumbling block of sockets - ``send`` and ``recv`` operate
on the network buffers. They do not necessarily handle all the bytes you hand
them (or expect from them), because their major focus is handling the network
buffers. In general, they return when the associated network buffers have been
filled (``send``) or emptied (``recv``). They then tell you how many bytes they
handled. It is *your* responsibility to call them again until your message has
been completely dealt with.
When a ``recv`` returns 0 bytes, it means the other side has closed (or is in
the process of closing) the connection. You will not receive any more data on
this connection. Ever. You may be able to send data successfully; I'll talk
more about this later.
A protocol like HTTP uses a socket for only one transfer. The client sends a
request, then reads a reply. That's it. The socket is discarded. This means that
a client can detect the end of the reply by receiving 0 bytes.
But if you plan to reuse your socket for further transfers, you need to realize
that *there is no* :abbr:`EOT (End of Transfer)` *on a socket.* I repeat: if a socket
``send`` or ``recv`` returns after handling 0 bytes, the connection has been
broken. If the connection has *not* been broken, you may wait on a ``recv``
forever, because the socket will *not* tell you that there's nothing more to
read (for now). Now if you think about that a bit, you'll come to realize a
fundamental truth of sockets: *messages must either be fixed length* (yuck), *or
be delimited* (shrug), *or indicate how long they are* (much better), *or end by
shutting down the connection*. The choice is entirely yours, (but some ways are
righter than others).
Assuming you don't want to end the connection, the simplest solution is a fixed
length message::
class mysocket:
'''demonstration class only
- coded for clarity, not efficiency
'''
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(self, host, port):
self.sock.connect((host, port))
def mysend(self, msg):
totalsent = 0
while totalsent < MSGLEN:
sent = self.sock.send(msg[totalsent:])
if sent == 0:
raise RuntimeError("socket connection broken")
totalsent = totalsent + sent
def myreceive(self):
chunks = []
bytes_recd = 0
while bytes_recd < MSGLEN:
chunk = self.sock.recv(min(MSGLEN - bytes_recd, 2048))
if chunk == '':
raise RuntimeError("socket connection broken")
chunks.append(chunk)
bytes_recd = bytes_recd + len(chunk)
return ''.join(chunks)
The sending code here is usable for almost any messaging scheme - in Python you
send strings, and you can use ``len()`` to determine its length (even if it has
embedded ``\0`` characters). It's mostly the receiving code that gets more
complex. (And in C, it's not much worse, except you can't use ``strlen`` if the
message has embedded ``\0``\ s.)
The easiest enhancement is to make the first character of the message an
indicator of message type, and have the type determine the length. Now you have
two ``recv``\ s - the first to get (at least) that first character so you can
look up the length, and the second in a loop to get the rest. If you decide to
go the delimited route, you'll be receiving in some arbitrary chunk size, (4096
or 8192 is frequently a good match for network buffer sizes), and scanning what
you've received for a delimiter.
One complication to be aware of: if your conversational protocol allows multiple
messages to be sent back to back (without some kind of reply), and you pass
``recv`` an arbitrary chunk size, you may end up reading the start of a
following message. You'll need to put that aside and hold onto it, until it's
needed.
Prefixing the message with its length (say, as 5 numeric characters) gets more
complex, because (believe it or not), you may not get all 5 characters in one
``recv``. In playing around, you'll get away with it; but in high network loads,
your code will very quickly break unless you use two ``recv`` loops - the first
to determine the length, the second to get the data part of the message. Nasty.
This is also when you'll discover that ``send`` does not always manage to get
rid of everything in one pass. And despite having read this, you will eventually
get bit by it!
In the interests of space, building your character, (and preserving my
competitive position), these enhancements are left as an exercise for the
reader. Lets move on to cleaning up.
Binary Data
-----------
It is perfectly possible to send binary data over a socket. The major problem is
that not all machines use the same formats for binary data. For example, a
Motorola chip will represent a 16 bit integer with the value 1 as the two hex
bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00.
Socket libraries have calls for converting 16 and 32 bit integers - ``ntohl,
htonl, ntohs, htons`` where "n" means *network* and "h" means *host*, "s" means
*short* and "l" means *long*. Where network order is host order, these do
nothing, but where the machine is byte-reversed, these swap the bytes around
appropriately.
In these days of 32 bit machines, the ascii representation of binary data is
frequently smaller than the binary representation. That's because a surprising
amount of the time, all those longs have the value 0, or maybe 1. The string "0"
would be two bytes, while binary is four. Of course, this doesn't fit well with
fixed-length messages. Decisions, decisions.
Disconnecting
=============
Strictly speaking, you're supposed to use ``shutdown`` on a socket before you
``close`` it. The ``shutdown`` is an advisory to the socket at the other end.
Depending on the argument you pass it, it can mean "I'm not going to send
anymore, but I'll still listen", or "I'm not listening, good riddance!". Most
socket libraries, however, are so used to programmers neglecting to use this
piece of etiquette that normally a ``close`` is the same as ``shutdown();
close()``. So in most situations, an explicit ``shutdown`` is not needed.
One way to use ``shutdown`` effectively is in an HTTP-like exchange. The client
sends a request and then does a ``shutdown(1)``. This tells the server "This
client is done sending, but can still receive." The server can detect "EOF" by
a receive of 0 bytes. It can assume it has the complete request. The server
sends a reply. If the ``send`` completes successfully then, indeed, the client
was still receiving.
Python takes the automatic shutdown a step further, and says that when a socket
is garbage collected, it will automatically do a ``close`` if it's needed. But
relying on this is a very bad habit. If your socket just disappears without
doing a ``close``, the socket at the other end may hang indefinitely, thinking
you're just being slow. *Please* ``close`` your sockets when you're done.
When Sockets Die
----------------
Probably the worst thing about using blocking sockets is what happens when the
other side comes down hard (without doing a ``close``). Your socket is likely to
hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time
before giving up on a connection. If you're using threads, the entire thread is
essentially dead. There's not much you can do about it. As long as you aren't
doing something dumb, like holding a lock while doing a blocking read, the
thread isn't really consuming much in the way of resources. Do *not* try to kill
the thread - part of the reason that threads are more efficient than processes
is that they avoid the overhead associated with the automatic recycling of
resources. In other words, if you do manage to kill the thread, your whole
process is likely to be screwed up.
Non-blocking Sockets
====================
If you've understood the preceding, you already know most of what you need to
know about the mechanics of using sockets. You'll still use the same calls, in
much the same ways. It's just that, if you do it right, your app will be almost
inside-out.
In Python, you use ``socket.setblocking(0)`` to make it non-blocking. In C, it's
more complex, (for one thing, you'll need to choose between the BSD flavor
``O_NONBLOCK`` and the almost indistinguishable Posix flavor ``O_NDELAY``, which
is completely different from ``TCP_NODELAY``), but it's the exact same idea. You
do this after creating the socket, but before using it. (Actually, if you're
nuts, you can switch back and forth.)
The major mechanical difference is that ``send``, ``recv``, ``connect`` and
``accept`` can return without having done anything. You have (of course) a
number of choices. You can check return code and error codes and generally drive
yourself crazy. If you don't believe me, try it sometime. Your app will grow
large, buggy and suck CPU. So let's skip the brain-dead solutions and do it
right.
Use ``select``.
In C, coding ``select`` is fairly complex. In Python, it's a piece of cake, but
it's close enough to the C version that if you understand ``select`` in Python,
you'll have little trouble with it in C::
ready_to_read, ready_to_write, in_error = \
select.select(
potential_readers,
potential_writers,
potential_errs,
timeout)
You pass ``select`` three lists: the first contains all sockets that you might
want to try reading; the second all the sockets you might want to try writing
to, and the last (normally left empty) those that you want to check for errors.
You should note that a socket can go into more than one list. The ``select``
call is blocking, but you can give it a timeout. This is generally a sensible
thing to do - give it a nice long timeout (say a minute) unless you have good
reason to do otherwise.
In return, you will get three lists. They contain the sockets that are actually
readable, writable and in error. Each of these lists is a subset (possibly
empty) of the corresponding list you passed in.
If a socket is in the output readable list, you can be
as-close-to-certain-as-we-ever-get-in-this-business that a ``recv`` on that
socket will return *something*. Same idea for the writable list. You'll be able
to send *something*. Maybe not all you want to, but *something* is better than
nothing. (Actually, any reasonably healthy socket will return as writable - it
just means outbound network buffer space is available.)
If you have a "server" socket, put it in the potential_readers list. If it comes
out in the readable list, your ``accept`` will (almost certainly) work. If you
have created a new socket to ``connect`` to someone else, put it in the
potential_writers list. If it shows up in the writable list, you have a decent
chance that it has connected.
One very nasty problem with ``select``: if somewhere in those input lists of
sockets is one which has died a nasty death, the ``select`` will fail. You then
need to loop through every single damn socket in all those lists and do a
``select([sock],[],[],0)`` until you find the bad one. That timeout of 0 means
it won't take long, but it's ugly.
Actually, ``select`` can be handy even with blocking sockets. It's one way of
determining whether you will block - the socket returns as readable when there's
something in the buffers. However, this still doesn't help with the problem of
determining whether the other end is done, or just busy with something else.
**Portability alert**: On Unix, ``select`` works both with the sockets and
files. Don't try this on Windows. On Windows, ``select`` works with sockets
only. Also note that in C, many of the more advanced socket options are done
differently on Windows. In fact, on Windows I usually use threads (which work
very, very well) with my sockets. Face it, if you want any kind of performance,
your code will look very different on Windows than on Unix.
Performance
-----------
There's no question that the fastest sockets code uses non-blocking sockets and
select to multiplex them. You can put together something that will saturate a
LAN connection without putting any strain on the CPU. The trouble is that an app
written this way can't do much of anything else - it needs to be ready to
shuffle bytes around at all times.
Assuming that your app is actually supposed to do something more than that,
threading is the optimal solution, (and using non-blocking sockets will be
faster than using blocking sockets). Unfortunately, threading support in Unixes
varies both in API and quality. So the normal Unix solution is to fork a
subprocess to deal with each connection. The overhead for this is significant
(and don't do this on Windows - the overhead of process creation is enormous
there). It also means that unless each subprocess is completely independent,
you'll need to use another form of IPC, say a pipe, or shared memory and
semaphores, to communicate between the parent and child processes.
Finally, remember that even though blocking sockets are somewhat slower than
non-blocking, in many cases they are the "right" solution. After all, if your
app is driven by the data it receives over a socket, there's not much sense in
complicating the logic just so your app can wait on ``select`` instead of
``recv``.

314
Doc/howto/sorting.rst Normal file
View File

@@ -0,0 +1,314 @@
.. _sortinghowto:
Sorting HOW TO
**************
:Author: Andrew Dalke and Raymond Hettinger
:Release: 0.1
Python lists have a built-in :meth:`list.sort` method that modifies the list
in-place. There is also a :func:`sorted` built-in function that builds a new
sorted list from an iterable.
In this document, we explore the various techniques for sorting data using Python.
Sorting Basics
==============
A simple ascending sort is very easy: just call the :func:`sorted` function. It
returns a new sorted list::
>>> sorted([5, 2, 3, 1, 4])
[1, 2, 3, 4, 5]
You can also use the :meth:`list.sort` method of a list. It modifies the list
in-place (and returns ``None`` to avoid confusion). Usually it's less convenient
than :func:`sorted` - but if you don't need the original list, it's slightly
more efficient.
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> a
[1, 2, 3, 4, 5]
Another difference is that the :meth:`list.sort` method is only defined for
lists. In contrast, the :func:`sorted` function accepts any iterable.
>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'})
[1, 2, 3, 4, 5]
Key Functions
=============
Starting with Python 2.4, both :meth:`list.sort` and :func:`sorted` added a
*key* parameter to specify a function to be called on each list element prior to
making comparisons.
For example, here's a case-insensitive string comparison:
>>> sorted("This is a test string from Andrew".split(), key=str.lower)
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
The value of the *key* parameter should be a function that takes a single argument
and returns a key to use for sorting purposes. This technique is fast because
the key function is called exactly once for each input record.
A common pattern is to sort complex objects using some of the object's indices
as keys. For example:
>>> student_tuples = [
... ('john', 'A', 15),
... ('jane', 'B', 12),
... ('dave', 'B', 10),
... ]
>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The same technique works for objects with named attributes. For example:
>>> class Student:
... def __init__(self, name, grade, age):
... self.name = name
... self.grade = grade
... self.age = age
... def __repr__(self):
... return repr((self.name, self.grade, self.age))
>>> student_objects = [
... Student('john', 'A', 15),
... Student('jane', 'B', 12),
... Student('dave', 'B', 10),
... ]
>>> sorted(student_objects, key=lambda student: student.age) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
Operator Module Functions
=========================
The key-function patterns shown above are very common, so Python provides
convenience functions to make accessor functions easier and faster. The operator
module has :func:`operator.itemgetter`, :func:`operator.attrgetter`, and
starting in Python 2.5 an :func:`operator.methodcaller` function.
Using those functions, the above examples become simpler and faster:
>>> from operator import itemgetter, attrgetter
>>> sorted(student_tuples, key=itemgetter(2))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
>>> sorted(student_objects, key=attrgetter('age'))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The operator module functions allow multiple levels of sorting. For example, to
sort by *grade* then by *age*:
>>> sorted(student_tuples, key=itemgetter(1,2))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
>>> sorted(student_objects, key=attrgetter('grade', 'age'))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
The :func:`operator.methodcaller` function makes method calls with fixed
parameters for each object being sorted. For example, the :meth:`str.count`
method could be used to compute message priority by counting the
number of exclamation marks in a message:
>>> from operator import methodcaller
>>> messages = ['critical!!!', 'hurry!', 'standby', 'immediate!!']
>>> sorted(messages, key=methodcaller('count', '!'))
['standby', 'hurry!', 'immediate!!', 'critical!!!']
Ascending and Descending
========================
Both :meth:`list.sort` and :func:`sorted` accept a *reverse* parameter with a
boolean value. This is used to flag descending sorts. For example, to get the
student data in reverse *age* order:
>>> sorted(student_tuples, key=itemgetter(2), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
>>> sorted(student_objects, key=attrgetter('age'), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
Sort Stability and Complex Sorts
================================
Starting with Python 2.2, sorts are guaranteed to be `stable
<https://en.wikipedia.org/wiki/Sorting_algorithm#Stability>`_\. That means that
when multiple records have the same key, their original order is preserved.
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
>>> sorted(data, key=itemgetter(0))
[('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]
Notice how the two records for *blue* retain their original order so that
``('blue', 1)`` is guaranteed to precede ``('blue', 2)``.
This wonderful property lets you build complex sorts in a series of sorting
steps. For example, to sort the student data by descending *grade* and then
ascending *age*, do the *age* sort first and then sort again using *grade*:
>>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key
>>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The `Timsort <https://en.wikipedia.org/wiki/Timsort>`_ algorithm used in Python
does multiple sorts efficiently because it can take advantage of any ordering
already present in a dataset.
The Old Way Using Decorate-Sort-Undecorate
==========================================
This idiom is called Decorate-Sort-Undecorate after its three steps:
* First, the initial list is decorated with new values that control the sort order.
* Second, the decorated list is sorted.
* Finally, the decorations are removed, creating a list that contains only the
initial values in the new order.
For example, to sort the student data by *grade* using the DSU approach:
>>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)]
>>> decorated.sort()
>>> [student for grade, i, student in decorated] # undecorate
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
This idiom works because tuples are compared lexicographically; the first items
are compared; if they are the same then the second items are compared, and so
on.
It is not strictly necessary in all cases to include the index *i* in the
decorated list, but including it gives two benefits:
* The sort is stable -- if two items have the same key, their order will be
preserved in the sorted list.
* The original items do not have to be comparable because the ordering of the
decorated tuples will be determined by at most the first two items. So for
example the original list could contain complex numbers which cannot be sorted
directly.
Another name for this idiom is
`Schwartzian transform <https://en.wikipedia.org/wiki/Schwartzian_transform>`_\,
after Randal L. Schwartz, who popularized it among Perl programmers.
For large lists and lists where the comparison information is expensive to
calculate, and Python versions before 2.4, DSU is likely to be the fastest way
to sort the list. For 2.4 and later, key functions provide the same
functionality.
The Old Way Using the *cmp* Parameter
=====================================
Many constructs given in this HOWTO assume Python 2.4 or later. Before that,
there was no :func:`sorted` builtin and :meth:`list.sort` took no keyword
arguments. Instead, all of the Py2.x versions supported a *cmp* parameter to
handle user specified comparison functions.
In Python 3, the *cmp* parameter was removed entirely (as part of a larger effort to
simplify and unify the language, eliminating the conflict between rich
comparisons and the :meth:`__cmp__` magic method).
In Python 2, :meth:`~list.sort` allowed an optional function which can be called for doing the
comparisons. That function should take two arguments to be compared and then
return a negative value for less-than, return zero if they are equal, or return
a positive value for greater-than. For example, we can do:
>>> def numeric_compare(x, y):
... return x - y
>>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare) # doctest: +SKIP
[1, 2, 3, 4, 5]
Or you can reverse the order of comparison with:
>>> def reverse_numeric(x, y):
... return y - x
>>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric) # doctest: +SKIP
[5, 4, 3, 2, 1]
When porting code from Python 2.x to 3.x, the situation can arise when you have
the user supplying a comparison function and you need to convert that to a key
function. The following wrapper makes that easy to do::
def cmp_to_key(mycmp):
'Convert a cmp= function into a key= function'
class K(object):
def __init__(self, obj, *args):
self.obj = obj
def __lt__(self, other):
return mycmp(self.obj, other.obj) < 0
def __gt__(self, other):
return mycmp(self.obj, other.obj) > 0
def __eq__(self, other):
return mycmp(self.obj, other.obj) == 0
def __le__(self, other):
return mycmp(self.obj, other.obj) <= 0
def __ge__(self, other):
return mycmp(self.obj, other.obj) >= 0
def __ne__(self, other):
return mycmp(self.obj, other.obj) != 0
return K
To convert to a key function, just wrap the old comparison function:
.. testsetup::
from functools import cmp_to_key
.. doctest::
>>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric))
[5, 4, 3, 2, 1]
In Python 2.7, the :func:`functools.cmp_to_key` function was added to the
functools module.
Odd and Ends
============
* For locale aware sorting, use :func:`locale.strxfrm` for a key function or
:func:`locale.strcoll` for a comparison function.
* The *reverse* parameter still maintains sort stability (so that records with
equal keys retain their original order). Interestingly, that effect can be
simulated without the parameter by using the builtin :func:`reversed` function
twice:
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
>>> standard_way = sorted(data, key=itemgetter(0), reverse=True)
>>> double_reversed = list(reversed(sorted(reversed(data), key=itemgetter(0))))
>>> assert standard_way == double_reversed
>>> standard_way
[('red', 1), ('red', 2), ('blue', 1), ('blue', 2)]
* To create a standard sort order for a class, just add the appropriate rich
comparison methods:
>>> Student.__eq__ = lambda self, other: self.age == other.age
>>> Student.__ne__ = lambda self, other: self.age != other.age
>>> Student.__lt__ = lambda self, other: self.age < other.age
>>> Student.__le__ = lambda self, other: self.age <= other.age
>>> Student.__gt__ = lambda self, other: self.age > other.age
>>> Student.__ge__ = lambda self, other: self.age >= other.age
>>> sorted(student_objects)
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
For general purpose comparisons, the recommended approach is to define all six
rich comparison operators. The :func:`functools.total_ordering` class
decorator makes this easy to implement.
* Key functions need not depend directly on the objects being sorted. A key
function can also access external resources. For instance, if the student grades
are stored in a dictionary, they can be used to sort a separate list of student
names:
>>> students = ['dave', 'john', 'jane']
>>> grades = {'john': 'F', 'jane':'A', 'dave': 'C'}
>>> sorted(students, key=grades.__getitem__)
['jane', 'dave', 'john']

748
Doc/howto/unicode.rst Normal file
View File

@@ -0,0 +1,748 @@
*****************
Unicode HOWTO
*****************
:Release: 1.03
This HOWTO discusses Python 2.x's support for Unicode, and explains
various problems that people commonly encounter when trying to work
with Unicode. For the Python 3 version, see
<https://docs.python.org/3/howto/unicode.html>.
Introduction to Unicode
=======================
History of Character Codes
--------------------------
In 1968, the American Standard Code for Information Interchange, better known by
its acronym ASCII, was standardized. ASCII defined numeric codes for various
characters, with the numeric values running from 0 to
127. For example, the lowercase letter 'a' is assigned 97 as its code
value.
ASCII was an American-developed standard, so it only defined unaccented
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
which required accented characters couldn't be faithfully represented in ASCII.
(Actually the missing accents matter for English, too, which contains words such
as 'naïve' and 'café', and some publications have house styles which require
spellings such as 'coöperate'.)
For a while people just wrote programs that didn't display accents. I remember
looking at Apple ][ BASIC programs, published in French-language publications in
the mid-1980s, that had lines like these::
PRINT "MISE A JOUR TERMINEE"
PRINT "PARAMETRES ENREGISTRES"
Those messages should contain accents, and they just look wrong to someone who
can read French.
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
machines assigned values between 128 and 255 to accented characters. Different
machines had different codes, however, which led to problems exchanging files.
Eventually various commonly used sets of values for the 128--255 range emerged.
Some were true standards, defined by the International Organization for
Standardization, and some were *de facto* conventions that were invented by one
company or another and managed to catch on.
255 characters aren't very many. For example, you can't fit both the accented
characters used in Western Europe and the Cyrillic alphabet used for Russian
into the 128--255 range because there are more than 128 such characters.
You could write files using different codes (all your Russian files in a coding
system called KOI8, all your French files in a different coding system called
Latin1), but what if you wanted to write a French document that quotes some
Russian text? In the 1980s people began to want to solve this problem, and the
Unicode standardization effort began.
Unicode started out using 16-bit characters instead of 8-bit characters. 16
bits means you have 2^16 = 65,536 distinct values available, making it possible
to represent many different characters from many different alphabets; an initial
goal was to have Unicode contain the alphabets for every single human language.
It turns out that even 16 bits isn't enough to meet that goal, and the modern
Unicode specification uses a wider range of codes, 0--1,114,111 (0x10ffff in
base-16).
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
originally separate efforts, but the specifications were merged with the 1.1
revision of Unicode.
(This discussion of Unicode's history is highly simplified. I don't think the
average Python programmer needs to worry about the historical details; consult
the Unicode consortium site listed in the References for more information.)
Definitions
-----------
A **character** is the smallest possible component of a text. 'A', 'B', 'C',
etc., are all different characters. So are 'È' and 'Í'. Characters are
abstractions, and vary depending on the language or context you're talking
about. For example, the symbol for ohms (Ω) is usually drawn much like the
capital letter omega (Ω) in the Greek alphabet (they may even be the same in
some fonts), but these are two different characters that have different
meanings.
The Unicode standard describes how characters are represented by **code
points**. A code point is an integer value, usually denoted in base 16. In the
standard, a code point is written using the notation U+12ca to mean the
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
of tables listing characters and their corresponding code points::
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it's meaningless to say 'this is
character U+12ca'. U+12ca is a code point, which represents some particular
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
informal contexts, this distinction between code points and characters will
sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical
elements that's called a **glyph**. The glyph for an uppercase A, for example,
is two diagonal strokes and a horizontal stroke, though the exact details will
depend on the font being used. Most Python code doesn't need to worry about
glyphs; figuring out the correct glyph to display is generally the job of a GUI
toolkit or a terminal's font renderer.
Encodings
---------
To summarize the previous section: a Unicode string is a sequence of code
points, which are numbers from 0 to 0x10ffff. This sequence needs to be
represented as a set of bytes (meaning, values from 0--255) in memory. The rules
for translating a Unicode string into a sequence of bytes are called an
**encoding**.
The first encoding you might think of is an array of 32-bit integers. In this
representation, the string "Python" would look like this::
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This representation is straightforward but using it presents a number of
problems.
1. It's not portable; different processors order the bytes differently.
2. It's very wasteful of space. In most texts, the majority of the code points
are less than 127, or less than 255, so a lot of space is occupied by zero
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
ASCII representation. Increased RAM usage doesn't matter too much (desktop
computers have megabytes of RAM, and strings aren't usually that large), but
expanding our usage of disk and network bandwidth by a factor of 4 is
intolerable.
3. It's not compatible with existing C functions such as ``strlen()``, so a new
family of wide string functions would need to be used.
4. Many Internet standards are defined in terms of textual data, and can't
handle content with embedded zero bytes.
Generally people don't use this encoding, instead choosing other
encodings that are more efficient and convenient. UTF-8 is probably
the most commonly supported encoding; it will be discussed below.
Encodings don't have to handle every possible Unicode character, and most
encodings don't. For example, Python's default encoding is the 'ascii'
encoding. The rules for converting a Unicode string into the ASCII encoding are
simple; for each code point:
1. If the code point is < 128, each byte is the same as the value of the code
point.
2. If the code point is 128 or greater, the Unicode string can't be represented
in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
case.)
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
0--255 are identical to the Latin-1 values, so converting to this encoding simply
requires converting code points to byte values; if a code point larger than 255
is encountered, the string can't be encoded into Latin-1.
Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
some sort of lookup table to perform the conversion, but this is largely an
internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
Transformation Format", and the '8' means that 8-bit numbers are used in the
encoding. (There's also a UTF-16 encoding, but it's less frequently used than
UTF-8.) UTF-8 uses the following rules:
1. If the code point is <128, it's represented by the corresponding byte value.
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
between 128 and 255.
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
2. A Unicode string is turned into a string of bytes containing no embedded zero
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
processed by C functions such as ``strcpy()`` and sent through protocols that
can't handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of code points are turned into two
bytes, and values less than 128 occupy only a single byte.
5. If bytes are corrupted or lost, it's possible to determine the start of the
next UTF-8-encoded code point and resynchronize. It's also unlikely that
random 8-bit data will look like valid UTF-8.
References
----------
The Unicode Consortium site at <http://www.unicode.org> has character charts, a
glossary, and PDF versions of the Unicode specification. Be prepared for some
difficult reading. <http://www.unicode.org/history/> is a chronology of the
origin and development of Unicode.
To help understand the standard, Jukka Korpela has written an introductory guide
to reading the Unicode character tables, available at
<https://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
Another good introductory article was written by Joel Spolsky
<http://www.joelonsoftware.com/articles/Unicode.html>.
If this introduction didn't make things clear to you, you should try reading this
alternate article before continuing.
.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
Wikipedia entries are often helpful; see the entries for "character encoding"
<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
<http://en.wikipedia.org/wiki/UTF-8>, for example.
Python 2.x's Unicode Support
============================
Now that you've learned the rudiments of Unicode, we can look at Python's
Unicode features.
The Unicode Type
----------------
Unicode strings are expressed as instances of the :class:`unicode` type, one of
Python's repertoire of built-in types. It derives from an abstract type called
:class:`basestring`, which is also an ancestor of the :class:`str` type; you can
therefore check if a value is a string type with ``isinstance(value,
basestring)``. Under the hood, Python represents Unicode strings as either 16-
or 32-bit integers, depending on how the Python interpreter was compiled.
The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
errors])``. All of its arguments should be 8-bit strings. The first argument
is converted to Unicode using the specified encoding; if you leave off the
``encoding`` argument, the ASCII encoding is used for the conversion, so
characters greater than 127 will be treated as errors::
>>> unicode('abcdef')
u'abcdef'
>>> s = unicode('abcdef')
>>> type(s)
<type 'unicode'>
>>> unicode('abcdef' + chr(255)) #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)
The ``errors`` argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are
'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
Unicode result). The following examples show the differences::
>>> unicode('\x80abc', errors='strict') #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'
Encodings are specified as strings containing the encoding's name. Python 2.7
comes with roughly 100 different encodings; see the Python Library Reference at
:ref:`standard-encodings` for a list. Some encodings
have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
synonyms for the same encoding.
One-character Unicode strings can also be created with the :func:`unichr`
built-in function, which takes integers and returns a Unicode string of length 1
that contains the corresponding code point. The reverse operation is the
built-in :func:`ord` function that takes a one-character Unicode string and
returns the code point value::
>>> unichr(40960)
u'\ua000'
>>> ord(u'\ua000')
40960
Instances of the :class:`unicode` type have many of the same methods as the
8-bit string type for operations such as searching and formatting::
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
>>> s.count('e')
5
>>> s.find('feather')
9
>>> s.find('bird')
-1
>>> s.replace('feather', 'sand')
u'Was ever sand so lightly blown to and fro as this multitude?'
>>> s.upper()
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
Note that the arguments to these methods can be Unicode strings or 8-bit
strings. 8-bit strings will be converted to Unicode before carrying out the
operation; Python's default ASCII encoding will be used, so characters greater
than 127 will cause an exception::
>>> s.find('Was\x9f') #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3:
ordinal not in range(128)
>>> s.find(u'Was\x9f')
-1
Much Python code that operates on strings will therefore work with Unicode
strings without requiring any changes to the code. (Input and output code needs
more updating for Unicode; more on this later.)
Another important method is ``.encode([encoding], [errors='strict'])``, which
returns an 8-bit string version of the Unicode string, encoded in the requested
encoding. The ``errors`` parameter is the same as the parameter of the
``unicode()`` constructor, with one additional possibility; as well as 'strict',
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
character references. The following example shows the different results::
>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
interprets the string using the given encoding::
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
The low-level routines for registering and accessing the available encodings are
found in the :mod:`codecs` module. However, the encoding and decoding functions
returned by this module are usually more low-level than is comfortable, so I'm
not going to describe the :mod:`codecs` module here. If you need to implement a
completely new encoding, you'll need to learn about the :mod:`codecs` module
interfaces, but implementing encodings is a specialized task that also won't be
covered here. Consult the Python documentation to learn more about this module.
The most commonly used part of the :mod:`codecs` module is the
:func:`codecs.open` function which will be discussed in the section on input and
output.
Unicode Literals in Python Source Code
--------------------------------------
In Python source code, Unicode literals are written as strings prefixed with the
'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written
using the ``\u`` escape sequence, which is followed by four hex digits giving
the code point. The ``\U`` escape sequence is similar, but expects 8 hex
digits, not 4.
Unicode literals can also use the same escape sequences as 8-bit strings,
including ``\x``, but ``\x`` only takes two hex digits so it can't express an
arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
::
>>> s = u"a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768
Using escape sequences for code points greater than 127 is fine in small doses,
but becomes an annoyance if you're using many accented characters, as you would
in a program with messages in French or some other accent-using language. You
can also assemble strings using the :func:`unichr` built-in function, but this is
even more tedious.
Ideally, you'd want to be able to write literals in your language's natural
encoding. You could then edit Python source code with your favorite editor
which would display the accented characters naturally, and have the right
characters used at runtime.
Python supports writing Unicode literals in any encoding, but you have to
declare the encoding being used. This is done by including a special comment as
either the first or second line of the source file::
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = u'abcdé'
print ord(u[-1])
The syntax is inspired by Emacs's notation for specifying variables local to a
file. Emacs supports many different variables, but Python only supports
'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
they have no significance to Python but are a convention. Python looks for
``coding: name`` or ``coding=name`` in the comment.
If you don't include such a comment, the default encoding used will be ASCII.
Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
encoding for string literals; in Python 2.4, characters greater than 127 still
work but result in a warning. For example, the following program has no
encoding declaration::
#!/usr/bin/env python
u = u'abcdé'
print ord(u[-1])
When you run it with Python 2.4, it will output the following warning::
amk:~$ python2.4 p263.py
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
in file p263.py on line 2, but no encoding declared;
see https://www.python.org/peps/pep-0263.html for details
Python 2.5 and higher are stricter and will produce a syntax error::
amk:~$ python2.5 p263.py
File "/tmp/p263.py", line 2
SyntaxError: Non-ASCII character '\xc3' in file /tmp/p263.py
on line 2, but no encoding declared; see
https://www.python.org/peps/pep-0263.html for details
Unicode Properties
------------------
The Unicode specification includes a database of information about code points.
For each code point that's defined, the information includes the character's
name, its category, the numeric value if applicable (Unicode has characters
representing the Roman numerals and fractions such as one-third and
four-fifths). There are also properties related to the code point's use in
bidirectional text and other display-related properties.
The following program displays some information about several characters, and
prints the numeric value of one particular character::
import unicodedata
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
# Get numeric value of second character
print unicodedata.numeric(u[1])
When run, this prints::
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
The category codes are abbreviations describing the nature of the character.
These are grouped into categories such as "Letter", "Number", "Punctuation", or
"Symbol", which in turn are broken up into subcategories. To take the codes
from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
other". See
<http://www.unicode.org/reports/tr44/#General_Category_Values> for a
list of category codes.
References
----------
The Unicode and 8-bit string types are described in the Python library reference
at :ref:`typesseq`.
The documentation for the :mod:`unicodedata` module.
The documentation for the :mod:`codecs` module.
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
Unicode". A PDF version of his slides is available at
<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
excellent overview of the design of Python's Unicode features.
Reading and Writing Unicode Data
================================
Once you've written some code that works with Unicode data, the next problem is
input/output. How do you get Unicode strings into your program, and how do you
convert Unicode into a form suitable for storage or transmission?
It's possible that you may not need to do anything depending on your input
sources and output destinations; you should check whether the libraries used in
your application support Unicode natively. XML parsers often return Unicode
data, for example. Many relational databases also support Unicode-valued
columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets
written to disk or sent over a socket. It's possible to do all the work
yourself: open a file, read an 8-bit string from it, and convert the string with
``unicode(str, encoding)``. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be
represented by several bytes. If you want to read the file in arbitrary-sized
chunks (say, 1K or 4K), you need to write error-handling code to catch the case
where only part of the bytes encoding a single Unicode character are read at the
end of a chunk. One solution would be to read the entire file into memory and
then perform the decoding, but that prevents you from working with files that
are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
(More, really, since for at least a moment you'd need to have both the encoded
string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case
of partial coding sequences. The work of implementing this has already been
done for you: the :mod:`codecs` module includes a version of the :func:`open`
function that returns a file-like object that assumes the file's contents are in
a specified encoding and accepts Unicode parameters for methods such as
``.read()`` and ``.write()``.
The function's parameters are ``open(filename, mode='rb', encoding=None,
errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
just like the corresponding parameter to the regular built-in ``open()``
function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
to the standard function's parameter. ``encoding`` is a string giving the
encoding to use; if it's left as ``None``, a regular Python file object that
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
data written to or read from the wrapper object will be converted as needed.
``errors`` specifies the action for encoding errors and can be one of the usual
values of 'strict', 'ignore', and 'replace'.
Reading Unicode from a file is therefore simple::
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
It's also possible to open files in update mode, allowing both reading and
writing::
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
written as the first character of a file in order to assist with autodetection
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
present at the start of a file; when such an encoding is used, the BOM will be
automatically written as the first character and will be silently dropped when
the file is read. There are variants of these encodings, such as 'utf-16-le'
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
particular byte ordering and don't skip the BOM.
Unicode filenames
-----------------
Most of the operating systems in common use today support filenames that contain
arbitrary Unicode characters. Usually this is implemented by converting the
Unicode string into some encoding that varies depending on the system. For
example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Windows, Python uses the name "mbcs" to refer to whatever the currently
configured encoding is. On Unix systems, there will only be a filesystem
encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
you haven't, the default encoding is ASCII.
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
your current system, in case you want to do the encoding manually, but there's
not much reason to bother. When opening a file for reading or writing, you can
usually just provide the Unicode string as the filename, and it will be
automatically converted to the right encoding for you::
filename = u'filename\u4500abc'
f = open(filename, 'w')
f.write('blah\n')
f.close()
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
filenames.
:func:`os.listdir`, which returns filenames, raises an issue: should it return
the Unicode version of filenames, or should it return 8-bit strings containing
the encoded versions? :func:`os.listdir` will do both, depending on whether you
provided the directory path as an 8-bit string or a Unicode string. If you pass
a Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing an 8-bit
path will return the 8-bit versions of the filenames. For example, assuming the
default filesystem encoding is UTF-8, running the following program::
fn = u'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print os.listdir('.')
print os.listdir(u'.')
will produce the following output:
.. code-block:: shell-session
amk:~$ python t.py
['.svn', 'filename\xe4\x94\x80abc', ...]
[u'.svn', u'filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains
the Unicode versions.
Tips for Writing Unicode-aware Programs
---------------------------------------
This section provides some suggestions on writing software that deals with
Unicode.
The most important tip is:
Software should only work with Unicode strings internally, converting to a
particular encoding on output.
If you attempt to write processing functions that accept both Unicode and 8-bit
strings, you will find your program vulnerable to bugs wherever you combine the
two different kinds of strings. Python's default encoding is ASCII, so whenever
a character with an ASCII value > 127 is in the input data, you'll get a
:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
encoding.
It's easy to miss such problems if you only test your software with data that
doesn't contain any accents; everything will seem to work, but there's actually
a bug in your program waiting for the first user who attempts to use characters
> 127. A second tip, therefore, is:
Include characters > 127 and, even better, characters > 255 in your test
data.
When using data coming from a web browser or some other untrusted source, a
common technique is to check for illegal characters in a string before using the
string in a generated command line or storing it in a database. If you're doing
this, be careful to check the string once it's in the form that will be used or
stored; it's possible for encodings to be used to disguise characters. This is
especially true if the input data also specifies the encoding; many encodings
leave the commonly checked-for characters alone, but Python includes some
encodings such as ``'base64'`` that modify every single character.
For example, let's say you have a content management system that takes a Unicode
filename, and you want to disallow paths with a '/' character. You might write
this code::
def read_file (filename, encoding):
if '/' in filename:
raise ValueError("'/' not allowed in filenames")
unicode_name = filename.decode(encoding)
f = open(unicode_name, 'r')
# ... return contents of file ...
However, if an attacker could specify the ``'base64'`` encoding, they could pass
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
characters in the encoded form and misses the dangerous character in the
resulting decoded form.
References
----------
The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
Applications in Python" are available at
<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
and discuss questions of character encodings as well as how to internationalize
and localize an application.
Revision History and Acknowledgements
=====================================
Thanks to the following people who have noted errors or offered suggestions on
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
Version 1.0: posted August 5 2005.
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
several links.
Version 1.02: posted August 16 2005. Corrects factual errors.
Version 1.03: posted June 20 2010. Notes that Python 3.x is not covered,
and that the HOWTO only covers 2.x.
.. comment Describe Python 3.x support (new section? new document?)
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
.. comment Describe obscure -U switch somewhere?
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
.. comment
Original outline:
- [ ] Unicode introduction
- [ ] ASCII
- [ ] Terms
- [ ] Character
- [ ] Code point
- [ ] Encodings
- [ ] Common encodings: ASCII, Latin-1, UTF-8
- [ ] Unicode Python type
- [ ] Writing unicode literals
- [ ] Obscurity: -U switch
- [ ] Built-ins
- [ ] unichr()
- [ ] ord()
- [ ] unicode() constructor
- [ ] Unicode type
- [ ] encode(), decode() methods
- [ ] Unicodedata module for character properties
- [ ] I/O
- [ ] Reading/writing Unicode data into files
- [ ] Byte-order marks
- [ ] Unicode filenames
- [ ] Writing Unicode programs
- [ ] Do everything in Unicode
- [ ] Declaring source code encodings (PEP 263)
- [ ] Other issues
- [ ] Building Python (UCS2, UCS4)

584
Doc/howto/urllib2.rst Normal file
View File

@@ -0,0 +1,584 @@
.. _urllib-howto:
************************************************
HOWTO Fetch Internet Resources Using urllib2
************************************************
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
.. note::
There is a French translation of an earlier revision of this
HOWTO, available at `urllib2 - Le Manuel manquant
<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
Introduction
============
.. sidebar:: Related Articles
You may also find useful the following article on fetching web resources
with Python:
* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
A tutorial on *Basic Authentication*, with examples in Python.
**urllib2** is a Python module for fetching URLs
(Uniform Resource Locators). It offers a very simple interface, in the form of
the *urlopen* function. This is capable of fetching URLs using a variety of
different protocols. It also offers a slightly more complex interface for
handling common situations - like basic authentication, cookies, proxies and so
on. These are provided by objects called handlers and openers.
urllib2 supports fetching URLs for many "URL schemes" (identified by the string
before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
This tutorial focuses on the most common case, HTTP.
For straightforward situations *urlopen* is very easy to use. But as soon as you
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
understanding of the HyperText Transfer Protocol. The most comprehensive and
authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
with enough detail about HTTP to help you through. It is not intended to replace
the :mod:`urllib2` docs, but is supplementary to them.
Fetching URLs
=============
The simplest way to use urllib2 is as follows::
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the
purpose of this tutorial to explain the more complicated cases, concentrating on
HTTP.
HTTP is based on requests and responses - the client makes requests and servers
send responses. urllib2 mirrors this with a ``Request`` object which represents
the HTTP request you are making. In its simplest form you create a Request
object that specifies the URL you want to fetch. Calling ``urlopen`` with this
Request object returns a response object for the URL requested. This response is
a file-like object, which means you can for example call ``.read()`` on the
response::
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
Note that urllib2 makes use of the same Request interface to handle all URL
schemes. For example, you can make an FTP request like so::
req = urllib2.Request('ftp://example.com/')
In the case of HTTP, there are two extra things that Request objects allow you
to do: First, you can pass data to be sent to the server. Second, you can pass
extra information ("metadata") *about* the data or the about request itself, to
the server - this information is sent as HTTP "headers". Let's look at each of
these in turn.
Data
----
Sometimes you want to send data to a URL (often the URL will refer to a CGI
(Common Gateway Interface) script [#]_ or other web application). With HTTP,
this is often done using what's known as a **POST** request. This is often what
your browser does when you submit a HTML form that you filled in on the web. Not
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
to your own application. In the common case of HTML forms, the data needs to be
encoded in a standard way, and then passed to the Request object as the ``data``
argument. The encoding is done using a function from the ``urllib`` library
*not* from ``urllib2``. ::
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
Note that other encodings are sometimes required (e.g. for file upload from HTML
forms - see `HTML Specification, Form Submission
<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
details).
If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
way in which GET and POST requests differ is that POST requests often have
"side-effects": they change the state of the system in some way (for example by
placing an order with the website for a hundredweight of tinned spam to be
delivered to your door). Though the HTTP standard makes it clear that POSTs are
intended to *always* cause side-effects, and GET requests *never* to cause
side-effects, nothing prevents a GET request from having side-effects, nor a
POST requests from having no side-effects. Data can also be passed in an HTTP
GET request by encoding it in the URL itself.
This is done as follows::
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values # The order may differ. #doctest: +SKIP
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.urlopen(full_url)
Notice that the full URL is created by adding a ``?`` to the URL, followed by
the encoded values.
Headers
-------
We'll discuss here one particular HTTP header, to illustrate how to add headers
to your HTTP request.
Some websites [#]_ dislike being browsed by programs, or send different versions
to different browsers [#]_. By default urllib2 identifies itself as
``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
numbers of the Python release,
e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the
``User-Agent`` header [#]_. When you create a Request object you can
pass a dictionary of headers in. The following example makes the same
request as above, but identifies itself as a version of Internet
Explorer [#]_. ::
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name': 'Michael Foord',
'location': 'Northampton',
'language': 'Python' }
headers = {'User-Agent': user_agent}
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
The response also has two useful methods. See the section on `info and geturl`_
which comes after we have a look at what happens when things go wrong.
Handling Exceptions
===================
*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
:exc:`TypeError` etc. may also be raised).
:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
HTTP URLs.
URLError
--------
Often, URLError is raised because there is no network connection (no route to
the specified server), or the specified server doesn't exist. In this case, the
exception raised will have a 'reason' attribute, which is a tuple containing an
error code and a text error message.
e.g. ::
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
... except urllib2.URLError as e:
... print e.reason #doctest: +SKIP
...
(4, 'getaddrinfo failed')
HTTPError
---------
Every HTTP response from the server contains a numeric "status code". Sometimes
the status code indicates that the server is unable to fulfil the request. The
default handlers will handle some of these responses for you (for example, if
the response is a "redirection" that requests the client fetch the document from
a different URL, urllib2 will handle that for you). For those it can't handle,
urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
found), '403' (request forbidden), and '401' (authentication required).
See section 10 of RFC 2616 for a reference on all the HTTP error codes.
The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
corresponds to the error sent by the server.
Error Codes
~~~~~~~~~~~
Because the default handlers handle redirects (codes in the 300 range), and
codes in the 100--299 range indicate success, you will usually only see error
codes in the 400--599 range.
``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of
response codes in that shows all the response codes used by RFC 2616. The
dictionary is reproduced here for convenience ::
# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}
When an error is raised the server responds by returning an HTTP error code
*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
page returned. This means that as well as the code attribute, it also has read,
geturl, and info, methods. ::
>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> try:
... urllib2.urlopen(req)
... except urllib2.HTTPError as e:
... print e.code
... print e.read() #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
...
404
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...
<title>Page Not Found</title>
...
Wrapping it Up
--------------
So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
basic approaches. I prefer the second approach.
Number 1
~~~~~~~~
::
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError as e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError as e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
# everything is fine
.. note::
The ``except HTTPError`` *must* come first, otherwise ``except URLError``
will *also* catch an :exc:`HTTPError`.
Number 2
~~~~~~~~
::
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine
info and geturl
===============
The response returned by urlopen (or the :exc:`HTTPError` instance) has two useful
methods :meth:`info` and :meth:`geturl`.
**geturl** - this returns the real URL of the page fetched. This is useful
because ``urlopen`` (or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL requested.
**info** - this returns a dictionary-like object that describes the page
fetched, particularly the headers sent by the server. It is currently an
``httplib.HTTPMessage`` instance.
Typical headers include 'Content-length', 'Content-type', and so on. See the
`Quick Reference to HTTP Headers <https://www.cs.tut.fi/~jkorpela/http.html>`_
for a useful listing of HTTP headers with brief explanations of their meaning
and use.
Openers and Handlers
====================
When you fetch a URL you use an opener (an instance of the perhaps
confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
the default opener - via ``urlopen`` - but you can create custom
openers. Openers use handlers. All the "heavy lifting" is done by the
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
redirections or HTTP cookies.
You will want to create openers if you want to fetch URLs with specific handlers
installed, for example to get an opener that handles cookies, or to get an
opener that does not handle redirections.
To create an opener, instantiate an ``OpenerDirector``, and then call
``.add_handler(some_handler_instance)`` repeatedly.
Alternatively, you can use ``build_opener``, which is a convenience function for
creating opener objects with a single function call. ``build_opener`` adds
several handlers by default, but provides a quick way to add more and/or
override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication,
and other common but slightly specialised situations.
``install_opener`` can be used to make an ``opener`` object the (global) default
opener. This means that calls to ``urlopen`` will use the opener you have
installed.
Opener objects have an ``open`` method, which can be called directly to fetch
urls in the same way as the ``urlopen`` function: there's no need to call
``install_opener``, except as a convenience.
Basic Authentication
====================
To illustrate creating and installing a handler we will use the
``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
including an explanation of how Basic Authentication works - see the `Basic
Authentication Tutorial
<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
When authentication is required, the server sends a header (as well as the 401
error code) requesting authentication. This specifies the authentication scheme
and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
realm="REALM"``.
e.g. ::
WWW-Authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password
for the realm included as a header in the request. This is 'basic
authentication'. In order to simplify this process we can create an instance of
``HTTPBasicAuthHandler`` and an opener to use this handler.
The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
the mapping of URLs and realms to passwords and usernames. If you know what the
realm is (from the authentication header sent by the server), then you can use a
``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
you to specify a default username and password for a URL. This will be supplied
in the absence of you providing an alternative combination for a specific
realm. We indicate this by providing ``None`` as the realm argument to the
``add_password`` method.
The top-level URL is the first URL that requires authentication. URLs "deeper"
than the URL you pass to .add_password() will also match. ::
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
# use the opener to fetch a URL
opener.open(a_url)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
.. note::
In the above example we only supplied our ``HTTPBasicAuthHandler`` to
``build_opener``. By default openers have the handlers for normal situations
-- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
``FileHandler``, ``HTTPErrorProcessor``.
``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
component and the hostname and optionally the port number)
e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
(the latter example includes a port number). The authority, if present, must
NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
not correct.
Proxies
=======
**urllib2** will auto-detect your proxy settings and use those. This is through
the ``ProxyHandler``, which is part of the normal handler chain when a proxy
setting is detected. Normally that's a good thing, but there are occasions
when it may not be helpful [#]_. One way to do this is to setup our own
``ProxyHandler``, with no proxies defined. This is done using similar steps to
setting up a `Basic Authentication`_ handler: ::
>>> proxy_support = urllib2.ProxyHandler({})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
.. note::
Currently ``urllib2`` *does not* support fetching of ``https`` locations
through a proxy. However, this can be enabled by extending urllib2 as
shown in the recipe [#]_.
.. note::
``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
the documentation on :func:`~urllib.getproxies`.
Sockets and Layers
==================
The Python support for fetching resources from the web is layered. urllib2 uses
the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response
before timing out. This can be useful in applications which have to fetch web
pages. By default the socket module has *no timeout* and can hang. Currently,
the socket timeout is not exposed at the httplib or urllib2 levels. However,
you can set the default timeout globally for all sockets using ::
import socket
import urllib2
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
-------
Footnotes
=========
This document was reviewed and revised by John Lee.
.. [#] For an introduction to the CGI protocol see
`Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
.. [#] Google for example.
.. [#] Browser sniffing is a very bad practice for website design - building
sites using web standards is much more sensible. Unfortunately a lot of
sites still send different versions to different browsers.
.. [#] The user agent for MSIE 6 is
*'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
.. [#] For details of more HTTP request headers, see
`Quick Reference to HTTP Headers`_.
.. [#] In my case I have to use a proxy to access the internet at work. If you
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
is set to use the proxy, which urllib2 picks up on. In order to test
scripts with a localhost server, I have to prevent urllib2 from using
the proxy.
.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
<https://code.activestate.com/recipes/456195/>`_.

735
Doc/howto/webservers.rst Normal file
View File

@@ -0,0 +1,735 @@
*******************************
HOWTO Use Python in the web
*******************************
:Author: Marek Kubica
.. topic:: Abstract
This document shows how Python fits into the web. It presents some ways
to integrate Python with a web server, and general practices useful for
developing web sites.
Programming for the Web has become a hot topic since the rise of "Web 2.0",
which focuses on user-generated content on web sites. It has always been
possible to use Python for creating web sites, but it was a rather tedious task.
Therefore, many frameworks and helper tools have been created to assist
developers in creating faster and more robust sites. This HOWTO describes
some of the methods used to combine Python with a web server to create
dynamic content. It is not meant as a complete introduction, as this topic is
far too broad to be covered in one single document. However, a short overview
of the most popular libraries is provided.
.. seealso::
While this HOWTO tries to give an overview of Python in the web, it cannot
always be as up to date as desired. Web development in Python is rapidly
moving forward, so the wiki page on `Web Programming
<https://wiki.python.org/moin/WebProgramming>`_ may be more in sync with
recent development.
The Low-Level View
==================
When a user enters a web site, their browser makes a connection to the site's
web server (this is called the *request*). The server looks up the file in the
file system and sends it back to the user's browser, which displays it (this is
the *response*). This is roughly how the underlying protocol, HTTP, works.
Dynamic web sites are not based on files in the file system, but rather on
programs which are run by the web server when a request comes in, and which
*generate* the content that is returned to the user. They can do all sorts of
useful things, like display the postings of a bulletin board, show your email,
configure software, or just display the current time. These programs can be
written in any programming language the server supports. Since most servers
support Python, it is easy to use Python to create dynamic web sites.
Most HTTP servers are written in C or C++, so they cannot execute Python code
directly -- a bridge is needed between the server and the program. These
bridges, or rather interfaces, define how programs interact with the server.
There have been numerous attempts to create the best possible interface, but
there are only a few worth mentioning.
Not every web server supports every interface. Many web servers only support
old, now-obsolete interfaces; however, they can often be extended using
third-party modules to support newer ones.
Common Gateway Interface
------------------------
This interface, most commonly referred to as "CGI", is the oldest, and is
supported by nearly every web server out of the box. Programs using CGI to
communicate with their web server need to be started by the server for every
request. So, every request starts a new Python interpreter -- which takes some
time to start up -- thus making the whole interface only usable for low load
situations.
The upside of CGI is that it is simple -- writing a Python program which uses
CGI is a matter of about three lines of code. This simplicity comes at a
price: it does very few things to help the developer.
Writing CGI programs, while still possible, is no longer recommended. With
:ref:`WSGI <WSGI>`, a topic covered later in this document, it is possible to write
programs that emulate CGI, so they can be run as CGI if no better option is
available.
.. seealso::
The Python standard library includes some modules that are helpful for
creating plain CGI programs:
* :mod:`cgi` -- Handling of user input in CGI scripts
* :mod:`cgitb` -- Displays nice tracebacks when errors happen in CGI
applications, instead of presenting a "500 Internal Server Error" message
The Python wiki features a page on `CGI scripts
<https://wiki.python.org/moin/CgiScripts>`_ with some additional information
about CGI in Python.
Simple script for testing CGI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To test whether your web server works with CGI, you can use this short and
simple CGI program::
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# enable debugging
import cgitb
cgitb.enable()
print "Content-Type: text/plain;charset=utf-8"
print
print "Hello World!"
Depending on your web server configuration, you may need to save this code with
a ``.py`` or ``.cgi`` extension. Additionally, this file may also need to be
in a ``cgi-bin`` folder, for security reasons.
You might wonder what the ``cgitb`` line is about. This line makes it possible
to display a nice traceback instead of just crashing and displaying an "Internal
Server Error" in the user's browser. This is useful for debugging, but it might
risk exposing some confidential data to the user. You should not use ``cgitb``
in production code for this reason. You should *always* catch exceptions, and
display proper error pages -- end-users don't like to see nondescript "Internal
Server Errors" in their browsers.
Setting up CGI on your own server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you don't have your own web server, this does not apply to you. You can
check whether it works as-is, and if not you will need to talk to the
administrator of your web server. If it is a big host, you can try filing a
ticket asking for Python support.
If you are your own administrator or want to set up CGI for testing purposes on
your own computers, you have to configure it by yourself. There is no single
way to configure CGI, as there are many web servers with different
configuration options. Currently the most widely used free web server is
`Apache HTTPd <http://httpd.apache.org/>`_, or Apache for short. Apache can be
easily installed on nearly every system using the system's package management
tool. `lighttpd <http://www.lighttpd.net>`_ is another alternative and is
said to have better performance. On many systems this server can also be
installed using the package management tool, so manually compiling the web
server may not be needed.
* On Apache you can take a look at the `Dynamic Content with CGI
<http://httpd.apache.org/docs/2.2/howto/cgi.html>`_ tutorial, where everything
is described. Most of the time it is enough just to set ``+ExecCGI``. The
tutorial also describes the most common gotchas that might arise.
* On lighttpd you need to use the `CGI module
<http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModCGI>`_\ , which can be configured
in a straightforward way. It boils down to setting ``cgi.assign`` properly.
Common problems with CGI scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Using CGI sometimes leads to small annoyances while trying to get these
scripts to run. Sometimes a seemingly correct script does not work as
expected, the cause being some small hidden problem that's difficult to spot.
Some of these potential problems are:
* The Python script is not marked as executable. When CGI scripts are not
executable most web servers will let the user download it, instead of
running it and sending the output to the user. For CGI scripts to run
properly on Unix-like operating systems, the ``+x`` bit needs to be set.
Using ``chmod a+x your_script.py`` may solve this problem.
* On a Unix-like system, The line endings in the program file must be Unix
style line endings. This is important because the web server checks the
first line of the script (called shebang) and tries to run the program
specified there. It gets easily confused by Windows line endings (Carriage
Return & Line Feed, also called CRLF), so you have to convert the file to
Unix line endings (only Line Feed, LF). This can be done automatically by
uploading the file via FTP in text mode instead of binary mode, but the
preferred way is just telling your editor to save the files with Unix line
endings. Most editors support this.
* Your web server must be able to read the file, and you need to make sure the
permissions are correct. On unix-like systems, the server often runs as user
and group ``www-data``, so it might be worth a try to change the file
ownership, or making the file world readable by using ``chmod a+r
your_script.py``.
* The web server must know that the file you're trying to access is a CGI script.
Check the configuration of your web server, as it may be configured
to expect a specific file extension for CGI scripts.
* On Unix-like systems, the path to the interpreter in the shebang
(``#!/usr/bin/env python``) must be correct. This line calls
``/usr/bin/env`` to find Python, but it will fail if there is no
``/usr/bin/env``, or if Python is not in the web server's path. If you know
where your Python is installed, you can also use that full path. The
commands ``whereis python`` and ``type -p python`` could help you find
where it is installed. Once you know the path, you can change the shebang
accordingly: ``#!/usr/bin/python``.
* The file must not contain a BOM (Byte Order Mark). The BOM is meant for
determining the byte order of UTF-16 and UTF-32 encodings, but some editors
write this also into UTF-8 files. The BOM interferes with the shebang line,
so be sure to tell your editor not to write the BOM.
* If the web server is using :ref:`mod-python`, ``mod_python`` may be having
problems. ``mod_python`` is able to handle CGI scripts by itself, but it can
also be a source of issues.
.. _mod-python:
mod_python
----------
People coming from PHP often find it hard to grasp how to use Python in the web.
Their first thought is mostly `mod_python <http://modpython.org/>`_\ ,
because they think that this is the equivalent to ``mod_php``. Actually, there
are many differences. What ``mod_python`` does is embed the interpreter into
the Apache process, thus speeding up requests by not having to start a Python
interpreter for each request. On the other hand, it is not "Python intermixed
with HTML" in the way that PHP is often intermixed with HTML. The Python
equivalent of that is a template engine. ``mod_python`` itself is much more
powerful and provides more access to Apache internals. It can emulate CGI,
work in a "Python Server Pages" mode (similar to JSP) which is "HTML
intermingled with Python", and it has a "Publisher" which designates one file
to accept all requests and decide what to do with them.
``mod_python`` does have some problems. Unlike the PHP interpreter, the Python
interpreter uses caching when executing files, so changes to a file will
require the web server to be restarted. Another problem is the basic concept
-- Apache starts child processes to handle the requests, and unfortunately
every child process needs to load the whole Python interpreter even if it does
not use it. This makes the whole web server slower. Another problem is that,
because ``mod_python`` is linked against a specific version of ``libpython``,
it is not possible to switch from an older version to a newer (e.g. 2.4 to 2.5)
without recompiling ``mod_python``. ``mod_python`` is also bound to the Apache
web server, so programs written for ``mod_python`` cannot easily run on other
web servers.
These are the reasons why ``mod_python`` should be avoided when writing new
programs. In some circumstances it still might be a good idea to use
``mod_python`` for deployment, but WSGI makes it possible to run WSGI programs
under ``mod_python`` as well.
FastCGI and SCGI
----------------
FastCGI and SCGI try to solve the performance problem of CGI in another way.
Instead of embedding the interpreter into the web server, they create
long-running background processes. There is still a module in the web server
which makes it possible for the web server to "speak" with the background
process. As the background process is independent of the server, it can be
written in any language, including Python. The language just needs to have a
library which handles the communication with the webserver.
The difference between FastCGI and SCGI is very small, as SCGI is essentially
just a "simpler FastCGI". As the web server support for SCGI is limited,
most people use FastCGI instead, which works the same way. Almost everything
that applies to SCGI also applies to FastCGI as well, so we'll only cover
the latter.
These days, FastCGI is never used directly. Just like ``mod_python``, it is only
used for the deployment of WSGI applications.
Setting up FastCGI
^^^^^^^^^^^^^^^^^^
Each web server requires a specific module.
* Apache has both `mod_fastcgi <http://www.fastcgi.com/drupal/>`_ and `mod_fcgid
<https://httpd.apache.org/mod_fcgid/>`_. ``mod_fastcgi`` is the original one, but it
has some licensing issues, which is why it is sometimes considered non-free.
``mod_fcgid`` is a smaller, compatible alternative. One of these modules needs
to be loaded by Apache.
* lighttpd ships its own `FastCGI module
<http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModFastCGI>`_ as well as an
`SCGI module <http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModSCGI>`_.
* `nginx <http://nginx.org/>`_ also supports `FastCGI
<https://www.nginx.com/resources/wiki/start/topics/examples/simplepythonfcgi/>`_.
Once you have installed and configured the module, you can test it with the
following WSGI-application::
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from cgi import escape
import sys, os
from flup.server.fcgi import WSGIServer
def app(environ, start_response):
start_response('200 OK', [('Content-Type', 'text/html')])
yield '<h1>FastCGI Environment</h1>'
yield '<table>'
for k, v in sorted(environ.items()):
yield '<tr><th>%s</th><td>%s</td></tr>' % (escape(k), escape(v))
yield '</table>'
WSGIServer(app).run()
This is a simple WSGI application, but you need to install `flup
<https://pypi.org/project/flup/1.0>`_ first, as flup handles the low level
FastCGI access.
.. seealso::
There is some documentation on `setting up Django with WSGI
<https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/>`_, most of
which can be reused for other WSGI-compliant frameworks and libraries.
Only the ``manage.py`` part has to be changed, the example used here can be
used instead. Django does more or less the exact same thing.
mod_wsgi
--------
`mod_wsgi <http://code.google.com/p/modwsgi/>`_ is an attempt to get rid of the
low level gateways. Given that FastCGI, SCGI, and mod_python are mostly used to
deploy WSGI applications, mod_wsgi was started to directly embed WSGI applications
into the Apache web server. mod_wsgi is specifically designed to host WSGI
applications. It makes the deployment of WSGI applications much easier than
deployment using other low level methods, which need glue code. The downside
is that mod_wsgi is limited to the Apache web server; other servers would need
their own implementations of mod_wsgi.
mod_wsgi supports two modes: embedded mode, in which it integrates with the
Apache process, and daemon mode, which is more FastCGI-like. Unlike FastCGI,
mod_wsgi handles the worker-processes by itself, which makes administration
easier.
.. _WSGI:
Step back: WSGI
===============
WSGI has already been mentioned several times, so it has to be something
important. In fact it really is, and now it is time to explain it.
The *Web Server Gateway Interface*, or WSGI for short, is defined in
:pep:`333` and is currently the best way to do Python web programming. While
it is great for programmers writing frameworks, a normal web developer does not
need to get in direct contact with it. When choosing a framework for web
development it is a good idea to choose one which supports WSGI.
The big benefit of WSGI is the unification of the application programming
interface. When your program is compatible with WSGI -- which at the outer
level means that the framework you are using has support for WSGI -- your
program can be deployed via any web server interface for which there are WSGI
wrappers. You do not need to care about whether the application user uses
mod_python or FastCGI or mod_wsgi -- with WSGI your application will work on
any gateway interface. The Python standard library contains its own WSGI
server, :mod:`wsgiref`, which is a small web server that can be used for
testing.
A really great WSGI feature is middleware. Middleware is a layer around your
program which can add various functionality to it. There is quite a bit of
`middleware <https://wsgi.readthedocs.org/en/latest/libraries.html>`_ already
available. For example, instead of writing your own session management (HTTP
is a stateless protocol, so to associate multiple HTTP requests with a single
user your application must create and manage such state via a session), you can
just download middleware which does that, plug it in, and get on with coding
the unique parts of your application. The same thing with compression -- there
is existing middleware which handles compressing your HTML using gzip to save
on your server's bandwidth. Authentication is another problem that is easily
solved using existing middleware.
Although WSGI may seem complex, the initial phase of learning can be very
rewarding because WSGI and the associated middleware already have solutions to
many problems that might arise while developing web sites.
WSGI Servers
------------
The code that is used to connect to various low level gateways like CGI or
mod_python is called a *WSGI server*. One of these servers is ``flup``, which
supports FastCGI and SCGI, as well as `AJP
<https://en.wikipedia.org/wiki/Apache_JServ_Protocol>`_. Some of these servers
are written in Python, as ``flup`` is, but there also exist others which are
written in C and can be used as drop-in replacements.
There are many servers already available, so a Python web application
can be deployed nearly anywhere. This is one big advantage that Python has
compared with other web technologies.
.. seealso::
A good overview of WSGI-related code can be found in the `WSGI homepage
<https://wsgi.readthedocs.org/>`_, which contains an extensive list of `WSGI servers
<https://wsgi.readthedocs.org/en/latest/servers.html>`_ which can be used by *any* application
supporting WSGI.
You might be interested in some WSGI-supporting modules already contained in
the standard library, namely:
* :mod:`wsgiref` -- some tiny utilities and servers for WSGI
Case study: MoinMoin
--------------------
What does WSGI give the web application developer? Let's take a look at
an application that's been around for a while, which was written in
Python without using WSGI.
One of the most widely used wiki software packages is `MoinMoin
<https://moinmo.in/>`_. It was created in 2000, so it predates WSGI by about
three years. Older versions needed separate code to run on CGI, mod_python,
FastCGI and standalone.
It now includes support for WSGI. Using WSGI, it is possible to deploy
MoinMoin on any WSGI compliant server, with no additional glue code.
Unlike the pre-WSGI versions, this could include WSGI servers that the
authors of MoinMoin know nothing about.
Model-View-Controller
=====================
The term *MVC* is often encountered in statements such as "framework *foo*
supports MVC". MVC is more about the overall organization of code, rather than
any particular API. Many web frameworks use this model to help the developer
bring structure to their program. Bigger web applications can have lots of
code, so it is a good idea to have an effective structure right from the beginning.
That way, even users of other frameworks (or even other languages, since MVC is
not Python-specific) can easily understand the code, given that they are
already familiar with the MVC structure.
MVC stands for three components:
* The *model*. This is the data that will be displayed and modified. In
Python frameworks, this component is often represented by the classes used by
an object-relational mapper.
* The *view*. This component's job is to display the data of the model to the
user. Typically this component is implemented via templates.
* The *controller*. This is the layer between the user and the model. The
controller reacts to user actions (like opening some specific URL), tells
the model to modify the data if necessary, and tells the view code what to
display,
While one might think that MVC is a complex design pattern, in fact it is not.
It is used in Python because it has turned out to be useful for creating clean,
maintainable web sites.
.. note::
While not all Python frameworks explicitly support MVC, it is often trivial
to create a web site which uses the MVC pattern by separating the data logic
(the model) from the user interaction logic (the controller) and the
templates (the view). That's why it is important not to write unnecessary
Python code in the templates -- it works against the MVC model and creates
chaos in the code base, making it harder to understand and modify.
.. seealso::
The English Wikipedia has an article about the `Model-View-Controller pattern
<https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller>`_. It includes a long
list of web frameworks for various programming languages.
Ingredients for Websites
========================
Websites are complex constructs, so tools have been created to help web
developers make their code easier to write and more maintainable. Tools like
these exist for all web frameworks in all languages. Developers are not forced
to use these tools, and often there is no "best" tool. It is worth learning
about the available tools because they can greatly simplify the process of
developing a web site.
.. seealso::
There are far more components than can be presented here. The Python wiki
has a page about these components, called
`Web Components <https://wiki.python.org/moin/WebComponents>`_.
Templates
---------
Mixing of HTML and Python code is made possible by a few libraries. While
convenient at first, it leads to horribly unmaintainable code. That's why
templates exist. Templates are, in the simplest case, just HTML files with
placeholders. The HTML is sent to the user's browser after filling in the
placeholders.
Python already includes two ways to build simple templates::
>>> template = "<html><body><h1>Hello %s!</h1></body></html>"
>>> print template % "Reader"
<html><body><h1>Hello Reader!</h1></body></html>
>>> from string import Template
>>> template = Template("<html><body><h1>Hello ${name}</h1></body></html>")
>>> print template.substitute(dict(name='Dinsdale'))
<html><body><h1>Hello Dinsdale!</h1></body></html>
To generate complex HTML based on non-trivial model data, conditional
and looping constructs like Python's *for* and *if* are generally needed.
*Template engines* support templates of this complexity.
There are a lot of template engines available for Python which can be used with
or without a `framework`_. Some of these define a plain-text programming
language which is easy to learn, partly because it is limited in scope.
Others use XML, and the template output is guaranteed to be always be valid
XML. There are many other variations.
Some `frameworks`_ ship their own template engine or recommend one in
particular. In the absence of a reason to use a different template engine,
using the one provided by or recommended by the framework is a good idea.
Popular template engines include:
* `Mako <http://www.makotemplates.org/>`_
* `Genshi <http://genshi.edgewall.org/>`_
* `Jinja <http://jinja.pocoo.org/>`_
.. seealso::
There are many template engines competing for attention, because it is
pretty easy to create them in Python. The page `Templating
<https://wiki.python.org/moin/Templating>`_ in the wiki lists a big,
ever-growing number of these. The three listed above are considered "second
generation" template engines and are a good place to start.
Data persistence
----------------
*Data persistence*, while sounding very complicated, is just about storing data.
This data might be the text of blog entries, the postings on a bulletin board or
the text of a wiki page. There are, of course, a number of different ways to store
information on a web server.
Often, relational database engines like `MySQL <http://www.mysql.com/>`_ or
`PostgreSQL <http://www.postgresql.org/>`_ are used because of their good
performance when handling very large databases consisting of millions of
entries. There is also a small database engine called `SQLite
<http://www.sqlite.org/>`_, which is bundled with Python in the :mod:`sqlite3`
module, and which uses only one file. It has no other dependencies. For
smaller sites SQLite is just enough.
Relational databases are *queried* using a language called `SQL
<https://en.wikipedia.org/wiki/SQL>`_. Python programmers in general do not
like SQL too much, as they prefer to work with objects. It is possible to save
Python objects into a database using a technology called `ORM
<https://en.wikipedia.org/wiki/Object-relational_mapping>`_ (Object Relational
Mapping). ORM translates all object-oriented access into SQL code under the
hood, so the developer does not need to think about it. Most `frameworks`_ use
ORMs, and it works quite well.
A second possibility is storing data in normal, plain text files (some
times called "flat files"). This is very easy for simple sites,
but can be difficult to get right if the web site is performing many
updates to the stored data.
A third possibility are object oriented databases (also called "object
databases"). These databases store the object data in a form that closely
parallels the way the objects are structured in memory during program
execution. (By contrast, ORMs store the object data as rows of data in tables
and relations between those rows.) Storing the objects directly has the
advantage that nearly all objects can be saved in a straightforward way, unlike
in relational databases where some objects are very hard to represent.
`Frameworks`_ often give hints on which data storage method to choose. It is
usually a good idea to stick to the data store recommended by the framework
unless the application has special requirements better satisfied by an
alternate storage mechanism.
.. seealso::
* `Persistence Tools <https://wiki.python.org/moin/PersistenceTools>`_ lists
possibilities on how to save data in the file system. Some of these
modules are part of the standard library
* `Database Programming <https://wiki.python.org/moin/DatabaseProgramming>`_
helps with choosing a method for saving data
* `SQLAlchemy <http://www.sqlalchemy.org/>`_, the most powerful OR-Mapper
for Python, and `Elixir <https://pypi.org/project/Elixir>`_, which makes
SQLAlchemy easier to use
* `SQLObject <http://www.sqlobject.org/>`_, another popular OR-Mapper
* `ZODB <https://launchpad.net/zodb>`_ and `Durus
<https://www.mems-exchange.org/software/>`_, two object oriented
databases
.. _framework:
Frameworks
==========
The process of creating code to run web sites involves writing code to provide
various services. The code to provide a particular service often works the
same way regardless of the complexity or purpose of the web site in question.
Abstracting these common solutions into reusable code produces what are called
"frameworks" for web development. Perhaps the most well-known framework for
web development is Ruby on Rails, but Python has its own frameworks. Some of
these were partly inspired by Rails, or borrowed ideas from Rails, but many
existed a long time before Rails.
Originally Python web frameworks tended to incorporate all of the services
needed to develop web sites as a giant, integrated set of tools. No two web
frameworks were interoperable: a program developed for one could not be
deployed on a different one without considerable re-engineering work. This led
to the development of "minimalist" web frameworks that provided just the tools
to communicate between the Python code and the http protocol, with all other
services to be added on top via separate components. Some ad hoc standards
were developed that allowed for limited interoperability between frameworks,
such as a standard that allowed different template engines to be used
interchangeably.
Since the advent of WSGI, the Python web framework world has been evolving
toward interoperability based on the WSGI standard. Now many web frameworks,
whether "full stack" (providing all the tools one needs to deploy the most
complex web sites) or minimalist, or anything in between, are built from
collections of reusable components that can be used with more than one
framework.
The majority of users will probably want to select a "full stack" framework
that has an active community. These frameworks tend to be well documented,
and provide the easiest path to producing a fully functional web site in
minimal time.
Some notable frameworks
-----------------------
There are an incredible number of frameworks, so they cannot all be covered
here. Instead we will briefly touch on some of the most popular.
Django
^^^^^^
`Django <https://www.djangoproject.com/>`_ is a framework consisting of several
tightly coupled elements which were written from scratch and work together very
well. It includes an ORM which is quite powerful while being simple to use,
and has a great online administration interface which makes it possible to edit
the data in the database with a browser. The template engine is text-based and
is designed to be usable for page designers who cannot write Python. It
supports template inheritance and filters (which work like Unix pipes). Django
has many handy features bundled, such as creation of RSS feeds or generic views,
which make it possible to create web sites almost without writing any Python code.
It has a big, international community, the members of which have created many
web sites. There are also a lot of add-on projects which extend Django's normal
functionality. This is partly due to Django's well written `online
documentation <https://docs.djangoproject.com/>`_ and the `Django book
<http://www.djangobook.com/>`_.
.. note::
Although Django is an MVC-style framework, it names the elements
differently, which is described in the `Django FAQ
<https://docs.djangoproject.com/en/dev/faq/general/#django-appears-to-be-a-mvc-framework-but-you-call-the-controller-the-view-and-the-view-the-template-how-come-you-don-t-use-the-standard-names>`_.
TurboGears
^^^^^^^^^^
Another popular web framework for Python is `TurboGears
<http://www.turbogears.org/>`_. TurboGears takes the approach of using already
existing components and combining them with glue code to create a seamless
experience. TurboGears gives the user flexibility in choosing components. For
example the ORM and template engine can be changed to use packages different
from those used by default.
The documentation can be found in the `TurboGears documentation
<https://turbogears.readthedocs.org/>`_, where links to screencasts can be found.
TurboGears has also an active user community which can respond to most related
questions. There is also a `TurboGears book <http://turbogears.org/1.0/docs/TGBooks.html>`_
published, which is a good starting point.
The newest version of TurboGears, version 2.0, moves even further in direction
of WSGI support and a component-based architecture. TurboGears 2 is based on
the WSGI stack of another popular component-based web framework, `Pylons
<http://www.pylonsproject.org/>`_.
Zope
^^^^
The Zope framework is one of the "old original" frameworks. Its current
incarnation in Zope2 is a tightly integrated full-stack framework. One of its
most interesting feature is its tight integration with a powerful object
database called the `ZODB <https://launchpad.net/zodb>`_ (Zope Object Database).
Because of its highly integrated nature, Zope wound up in a somewhat isolated
ecosystem: code written for Zope wasn't very usable outside of Zope, and
vice-versa. To solve this problem the Zope 3 effort was started. Zope 3
re-engineers Zope as a set of more cleanly isolated components. This effort
was started before the advent of the WSGI standard, but there is WSGI support
for Zope 3 from the `Repoze <http://repoze.org/>`_ project. Zope components
have many years of production use behind them, and the Zope 3 project gives
access to these components to the wider Python community. There is even a
separate framework based on the Zope components: `Grok
<http://grok.zope.org/>`_.
Zope is also the infrastructure used by the `Plone <https://plone.org/>`_ content
management system, one of the most powerful and popular content management
systems available.
Other notable frameworks
^^^^^^^^^^^^^^^^^^^^^^^^
Of course these are not the only frameworks that are available. There are
many other frameworks worth mentioning.
Another framework that's already been mentioned is `Pylons`_. Pylons is much
like TurboGears, but with an even stronger emphasis on flexibility, which comes
at the cost of being more difficult to use. Nearly every component can be
exchanged, which makes it necessary to use the documentation of every single
component, of which there are many. Pylons builds upon `Paste
<http://pythonpaste.org/>`_, an extensive set of tools which are handy for WSGI.
And that's still not everything. The most up-to-date information can always be
found in the Python wiki.
.. seealso::
The Python wiki contains an extensive list of `web frameworks
<https://wiki.python.org/moin/WebFrameworks>`_.
Most frameworks also have their own mailing lists and IRC channels, look out
for these on the projects' web sites.