Differences between revisions 2 and 12 (spanning 10 versions)
Revision 2 as of 2006-05-08 13:29:31
Size: 12318
Editor: outgw
Revision 12 as of 2006-05-12 19:58:09
Size: 24674
Editor: tcn-orr
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
This page describes parts of the Path class design which are in discussion. It is meant to show the current state of the discussion, so when we reach a consensus, we can delete all the discussion details and just write the decision.

Please write here your opinions. I (Noam) am terribly sorry, but due to lack of time (I don't sleep enough already) I only wrote my opinions. Please write yours, or write that you agree, so that we'll know if we agree on something or have to discuss it more.

The page is divided into sections, to make it easy to see what is said about what. Please open new sections if you have new subjects to discuss.
This page describes parts of the Path class design which are in discussion. It is meant to show the current state of the discussion, so when we reach a consensus, we can delete all the discussion details and just write the decision. Please write your opinions below in the appropriate section, or start a new section. Also indicate what you agree with, so we know how close to consensus we are.

This discussion will be used to write a PEP (an alternative to PEP 355) and reference implementation.
Line 11: Line 9:
'''agreed:''' A logical representation is better than a string representation. '''agreed:''' A logical representation is better than a string representation. {{{p[:]}}} should behave like a tuple of path components (directories and the final filename). {{{p[n1:n2}}} should return a new Path containing only the sliced components. {{{p1 + p2}}} should join paths (smartly if p2 is absolute). This eliminates the need for several properties/methods: .parent, .name, .join(), .split(), etc. {{{str(p)}}} should return a platform-specific string representing the entire path.

= Path + string =

What should {{{Path("/a/b") + "c"}}} do? Alternatives:

 1. Join paths. Same as {{{Path("/a/b/c")}}}.
 2. Append to the filename (useful for extensions). Same as {{{Path("/a/bc")}}}.
 3. Raise an exception because {{{tuple + string}}} and {{{list + string}}} are illegal in Python.

Noam: I know that tuple + string are illegal, but I think that since there's an obvious way to treat the string as a path, it's ok.

Mike: Maybe. But Guido rejected {{{/}}} for joining; he may also reject {{{+}}}. Its obviousness is debatable. If we do use {{{+}}} for joining, we'll need APIs to modify the filename and extension without having to split/rejoin. ''Note: discussion of a filename/extension API is in the Filename/Extensions section below.''
Line 15: Line 25:
Noam: A sequence. As Mike has said, a sequence allows for slicing to work simply. I think that's the main reason to use a sequence. Besides, you don't have to remember several attributes which save the data of the path: it's all in the sequence. About a different attribute for extensions: I don't like it. I think that extensions should make the logical representation more complex. See the section about extensions for a proposed solution. '''agreed:''' The filename or leaf directory should be the final component of the sequence, with extensions treated as part of the filename.

Should the root and drive be encoded as the first component of the sequence, or as attributes? On POSIX there is one root: "/". On Windows, each drive has its own root and current directory: r"C:\", r"D:\". There is an implied default drive, subject to .chdir. r"C:foo" is relative to drive C:'s current directory. Should we encode all this info in the first Path component, or as .isabs/.isabsolute and .root attributes? What about Windows UNC paths (r"\\a\b\c")?

Noam: Keep it all in the sequence. Sequence slicing is simple and intuitive. Attributes storing data not found in the sequence is complicating matters.

Mike: Encoding absolute/relative and drive in the sequence may be too obscure and magical.

Mike: Note that slicing off the front of an absolute path makes a relative path. {{{Path("/a/b/c")[:1] => Path("b/c")}}}
Line 19: Line 37:
Noam: I don't like it. Sometimes I don't know whether a path is a file or directory - for example, "svn add FILE" adds a file if it's a file and recursively adds all the files in the directory if it's a directory. It does so by examining FILE to see whether it's a file or a directory. I think that a path is representation of "how to get to somewhere on the filesystem", and it can result in a file, a directory, a symbolic link, or simply not exist.

Jason: I don't like it. It seems like I've worked with APIs like this and it's a pain. It doesn't let you remain uncertain about it. What do you do with isfile() and isdir() in this sort of design?

== Inheritence from str to allow easy use in other functions ==

Noam: I think that it doesn't work: Slicing by path element works differently from slicing by character, so inheriting from str breaks the rule that a subclass should behave like the base class.

== Inheritence from tuple ==
'''agreed:''' The same class will represent a file, directory, or symbolic link. (Reasons can be found in the wiki history)

== Inheritance from str to allow easy use in other functions ==

Noam, Mike: This won't work. Strings must slice by character, and this is incompatible with slicing by directory component.

== Inheritance from tuple ==
Line 33: Line 49:
== A different class from special treatment of symbolic links ==

(Nick proposed that)

Noam: I think it complicates matters, and I don't see what's the benefit. On the contrary: I think that specifying the kind of method to use is clearer than stating it once and forgetting about it.

Jason: Agree with Noam.
Noam: (My previous statement wasn't well formed) I guess this may be left to Guido's decision. I feel that subclassing from tuple is fine, but I don't really care.

Mike: the top level can emulate tuple slicing/addition to return a new Path object. It doesn't have to *subclass* tuple.

Noam: Can you please elaborate about why not to subclass from tuple?

Mike: Containment is better than inheritance. Never subclass if you can reasonably put the value in an attribute; it leads to all sorts of potential conflicts and bugs. Subclass only if the object really ''is'' a type of the superclass, and/or if the user will be calling a lot of the superclass's methods directly.

Noam: You can always use containment - you never really need to subclass. I think that if it's agreed that all the data is stored in the sequence, inheritence from tuple is ok, since we really behave like an immutable sequence, and add some operations about the sequence.
Line 43: Line 61:

I don't like that. I think that you should have a subclass for each platform, which is responsible for parsing a string and for formatting a string. For example, I saw in the macpath class that relative paths on the old mac start with ':'. I don't think that a root element can handle that.

I think that it makes much sense to have a different subclass for each platform: There are other things which are different for different platforms (some methods only available on one platform and not the other). A URL will also be another subclass, with its own appropriate methods.
'''agreed:''' The separator should be an attribute of the path class, not of the root element. (reasons are in version 10 of the wiki)

Mike: I don't think anybody proposed storing the separator in the root element; it was a misunderstanding. So this section can probably be deleted.
Line 51: Line 67:
Noam: I think that immutable paths are somewhat easier to implement, and allow usage as dictionary keys. I think that if we have managed to live so far without mutable strings, we will manage to live without mutable paths. I don't see this as a major issue, but immutable paths can be somewhat more efficient: you can hash the string representation, and you can make sure you have a path by writing things like {{{ dst=path(dst) }}}, and if dst is already a path, no new object will be created.

Jason: Agree with Noam.
Noam, Jason, Mike: I think that immutable paths are somewhat easier to implement, and allow usage as dictionary keys. I think that if we have managed to live so far without mutable strings, we will manage to live without mutable paths. I don't see this as a major issue, but immutable paths can be somewhat more efficient: you can hash the string representation, and you can make sure you have a path by writing things like {{{ dst=path(dst) }}}, and if dst is already a path, no new object will be created.

= In which module(s)? =

Mike: A new 'basepath' module would contain the common base class. The platform modules (posixpath, ntpath, etc) already exist and are the logical place for these Path classes.

Noam: I think that all path OS subclasses fit nicely into one module. Most of the logic is in the base class, anyway, and it makes it easier to see what are the differences between each platform.

Mike: Putting code for disparate architectures in one module is asking for trouble. What if one architecture needs to import modules which aren't needed or can't be built on other architectures, especially C modules? Plus the module would become very large due to the need to accommodate Windows's intricaies (e.g., r"C:foo", r"\\uncpath").

= Filenames =

Mike: If we use {{{+}}} for path joining, we need a way to create a derived path from a modified filename. Example: "I want to add a prefix or suffix to the filename portion of "/a/b/filename". Splitting/rejoining the path is messy, especially if you have to modify the base name but preserve the extensions. No specific API proposal yet.
Line 57: Line 83:
'''agreed:''' extensions are a common and platform-specific convention, so treating them should be made easy by the class.


I think that the basic representation should ignore extension conventions, as it doesn't matter for the path - the walk from one node to another. How about using string subclasses instead of normal strings for elements, that would behave exactly like normal strings but would allow some extension operations? For example, you would be able to write things like {{{ p[-1].ext }}}.

The interface should be defined: How should we distinguish between a file with an empty extension ("a.") and a file without an extension ("a")? And what should be the methods, anyway?
'''agreed:''' extensions are critical, so the class must make it easy to query/modify them without splitting/rejoining the Path. Like directories, extensions have a platform-specific separator. Unlike directories, extensions are conventions rather than OS-enforced rules: not every apparent extension should be treated as such. The user must tell us when to recognize extensions, defined as N number of filename suffixes beginning with the platform's extension separator (.extsep). For instance, most users consider "filename.2005-05-13.tar.gz" and "filename.2006.05.13.tar.gz" as having two extensions each (".tar.gz"), even though the number of apparent extensions is larger.

We can put attributes/methods on the Path object, or on a special str/unicode subclass used for the filename (or for each directory component).

Noam: How should we distinguish between a file with an empty extension ("a.") and a file without an extension ("a")?

Mike: The legacy os.path.splitext() returns ".ext", so it presumably returns "." for an empty extension. We could stick with this. That prevents the ability of treating extensions platform-independently though. I doubt "a." is important enough to support though, have you ever seen it?

Mike: subclassing str is impractical due to the string/unicode duality. Why not path properties: p.ext, p.name (name without extension). The full filename is p[-1] so it doesn't need a property.

Noam: Why does the string/unicode duability makes subclassing str impractical? On Windows we can have unicode subclasses, and on POSIX we can have str subclasses. Having extension-related methods added to elements is nice because:
 * Extension is an attribute of a path element, not of the sequence of path elements. (dirs can have extensions just as well)
 * It reduces the number of methods of the path type and makes it easier to distinguish between different kinds of methods.

What should be the interface? Mike said that adding and removing extensions is important. How should it be done?

Mike: There must be a convenient way to add/delete extensions. How about {{{p.add_ext(*exts_without_separator)}}} and {{{p.del_ext(n=1)}}}, each returning new Paths. The only other operations then are querying N extensions or splitting the filename into name + N extensions. (Note: if the extsep is attached, an empty string in the result would mean "there are not that many extensions".)
Line 67: Line 103:
Mike (quoted from an email):

Not sure about this. I see the point in not duplicating .foo() vs
.stat().foo. .foo() exists in os.path to avoid the ugliness of
os.stat() in the middle of an expression. I think the current
recommendation is to just do stats all the time because the overhead
is minimal and it's not worth getting out of sync.

The question is, does forcing people to use .stat() expose an
implementation detail that should be hidden, and does it smell of
Unixism? Most people think a file *is* a regular file or a directory.
 The fact that this is encoded in the file's permission bits -- which
stat() examines -- is a quirk of Unix.


I think that calling stat once is a reasonable thing. Where I work we have a really slow network, and you feel every filesystem call. I also think that calling stat repeatedly may cause synchronization bugs: the stat may change while the logic already assumes something about it.

I don't see stat as a unixism - what's wrong about getting information about a file?
'''agreed:''' p.stat() and p.lstat() should return an enhanced version of Python's os.stat() object, with attributes like {{{p.stat().mtime}}} for all information traditionally provided by stat. Include Noam's additional properties from http://wiki.python.org/moin/AlternativePathModule. Do not have Path methods duplicating stat attributes.

Mike: Unlike os.stat(), do not support ugly attributes like .st_mtime or tuple indexing.

Mike: I formerly proposed moving all stat attributes into Path methods, because the distinction between "stat attribute" and "other file info" was arbitrarily defined by Unix tradition, but withdrew this because it's not critical. Having .path() does let the user cache the result, and having .lpath() avoids the need for a parallel set of methods that don't follow symlinks.
Line 89: Line 111:
Swiss army methods are even more evil than wide APIs. And I consider the term 'glob' itself to be a Unixism - I've found the technique to be far more commonly known as wildcard matching in the Windows world.


Can you give examples why this proposed method is evil? I think that the basic pattern idea is well defined. It gets three arguments. topdown is, I think, well defined and may be useful. onlyfiles and onlydirs are well defined and are only a convinience. I don't really mind ommitting them.
Jason Orendorff's path module has three methods returning a non-recursive list of paths: listdir, files, dirs; and three methods returning a recursive iteration of paths: walk, walkfiles, walkdirs. Noam proposed combining all these plus filename globbing into one method: glob, with a special pattern "**" meaning "any subdirectory or recursive path of subdirectories".

Nick: Swiss army methods are even more evil than wide APIs. And I consider the term 'glob' itself to be a Unixism - I've found the technique to be far more commonly known as wildcard matching in the Windows world.

Noam: Can you give examples why this proposed method is evil? I think that the basic pattern idea is well defined. It gets three arguments. topdown is, I think, well defined and may be useful. onlyfiles and onlydirs are well defined and are only a convinience. I don't really mind ommitting them.
Line 101: Line 122:


I see what you mean. How about "glob" doing what it does in the current proposal, without the "onlyfiles" and "onlydirs" arguments, and "files" and "dirs" getting exactly the same arguments but yielding only files and directories, respectively?

About the "l" versions: Having glob, files, dirs, lglob, lfiles, ldirs seems ugly. Perhaps this ''should'' go in as a flag, say, "follow_symlinks=True"? (I would put it after pattern, because remembering the string "topdown" is easier. I don't think of any better name than "follow_symlinks". I also tend to think that it is more useful.)

Mike: Non-recursive lists: listdir, files, dirs, symlinks. Recursive iterators: walk, walkfiles, walkdirs, walklinks. All except *links should take a 'symlinks' argument, default True, meaning follow symlinks. If false, never return a symlink. The user can call *links to get the symlinks separately if desired. listdir should have a 'names_only' argument, default False, meaning return the same as os.listdir(). Doesn't a 'pattern' argument eliminate the need for .glob()?

Noam: Can you explain why you think that "listdir, files, dirs, walk, walkfiles, walkdirs" is bettern than "glob, files, dirs"? I prefer three over six. About "links" methods - Do you have examples of when they are useful? Thinking about it, it seems that dirs+files should cover all the files in the directory, when symlinks are considered directories if they point to directories in the follow_symlinks mode. About the names_only: I don't like an attribute which changes the type of the result. You can always do x[-1] to get the base name.

Mike: Combining the recursive and non-recursive methods is acceptable. They would all have to be generators in that case. .glob() is not the best name: it sounds like something else to Unix people and incomprehensible to non-Unix people. The *links() methods are useful when you want to treat symlinks specially; they eliminate an if-stanza in the main for loop. No reason to shove disparate things into the same loop. If symlinks=True, we do follow the links and inspect the actual directory/file, so we're in agreement. We can drop names_only if we add listdir(). Sometimes you just want the names, and it's a pain (and inefficient) to unpack temporary Path objects made from those same names.

About glob: Can you suggest a better name? I'm happy with glob but have nothing agains a better name.

About listdir: I prefer to omit that method. From my experience, you always want to add the base name to the dir name (what would you do with it otherwise?) I can live with the slight inefficiency and small pain of making a path and taking only the last element on the rare cases in which it's needed. I prefer the "one way to do it" approach here.

About symlinks: I see what you mean. I prefer one iteration with an if stanza, since I then iterate over the contents of a directory only once, but it seems like a reasonable friend of "dirs" and "files". The name "link" is ok, but we should make sure that all symlinks are referred to as "links" in the method names - I don't want to remember when it's a link and when it's a symlink. If so, the "link" method should be renamed "hardlink".

But "lfiles", "ldirs" are so ugly...

Mike: {{{p.listdir() => os.listdir(str(p))}}} is small, simple, and unobtrusive; it won't bother anybody except purists. Say you need the filenames for a GUI list box or a menu. It's hard to find a name that means ".dirs plus .files"; maybe {{{.walk(recursive=False)}}} is OK. That will surprise existing users of walk functions, but we haven't found a better name. I agree we should be consistent about .(sym)links methods; maybe we should rename .link to .hardlink because it's so rarely used. 'follow_symlinks' as an argument is also acceptable; it's wordy but perhaps better self-explanatory than 'symlinks'. Down with the 'l' versions!
Line 162: Line 206:
Mike: .pairedwalk() and friends may be useful. The user wants to know which files/directories to create, update, and delete. So it's essentially a diff report.

Noam: I'm not sure about pairedwalk() - it may be a bit complicated, I'm afraid. However, perhaps copytree() isn't such a big deal if it works only when the source is a directory and the destination doesn't exist. Then, exceptions aren't expected, so if they happen they can simply be propagated.
Line 180: Line 228:
Noam: What about copyto? It's easier to write, I think that it's not hard to understand, and perhaps it focuses less attention on the "to", making it look like a special kind of copy.

Mike: src.copy(dest, content=True, mode=False, time=False). .copy_to is OK, .copyto is bad. Almost everybody expects .copy to mean .copy_to and not .copy_from.

Noam: Please, what does the content/copyfile argument mean? About copymode vs. mode: I prefer copymode. "mode" seems like a mode specification (like in mkdir), not like a boolean.

Mike: 'content' means copy the file contents. If content=False and the destination doesn't exist, create an empty file. If content=False and the destination does exist, copy the file attributes only but don't modify the content. This covers all copying use cases.

Noam: Why should you want to copy only attributes? Can you give an example?

Mike: When you want to make the perms/mtime of one file match another file. When you want to create an empty log file with the same perms/mtime as another file, because the logging program will have modify permission but not create permission.
Line 191: Line 251:


Thanks for the explanation. I agree about not mixing different kinds of strings. Is there a good way to convert unicode strings into file names on POSIX? How do you know the right encoding?

Mike: At first I thought about forcing everything to Unicode on input and adding 'encoding' and 'onerror' arguments to the constructor. That doesn't solve the problem of chosing the charset to encode on output. But now I'm wondering if we should just preserve whatever type(s) the user inputs.

Noam: I don't think that preserving the type of the user input will work: You'll still have to decode it to str on POSIX. It seems to me that the only solution is to use the native "alphabet" of the system: Unicode chars on Windows, and byte chars on POSIX. To put it more clearly: All elements on Windows will be unicode, all elements on POSIX will be str.
Line 205: Line 273:

Mike: We cannot deprecate the existing functions in Python 2.x; too many existing programs would break. But we can discourage them in the documentation.

= Additional methods/attributes =

== .purge() ==

Mike: Delete "it" recursively if it exists, whatever it is. This is convenient when you don't care whether it's a file or directory, you just want to overwrite it, and you don't want to take six lines of code to do it.

Noam: Why six lines of code? I count four:

if p.isfile():
elif p.isdir():

We can have rmtree work also for files, and even for non-existing paths, but I'm not sure it's a good idea.

Mike: .rmtree would go away if .purge is added. So we'd have to inline its implementation. The main reason for .purge is .rmtree raises exceptions if (A) the Path is a file, or (B) the Path doesn't exist, and you don't want to clutter your code for all those cases when you just want to write or remove "it".

Noam: I feel fine with the four lines above, but I can live with another method. We can bring this to python-dev decision.

Mike: Adding the two capabilities to .rmtree would be functionally the same. I think .purge is a better name though.

= mkdir/rmdir =

Mike: These should succeed silently if the operation is already done. Otherwise the user has to write an unnecessary "if p.exists():" around it. If the user really cares whether the item exists, he can explicity write the if-stanza. If not, he shouldn't be forced to clutter his code, especially since that obscures whether it does matter or not that the item existed.

This page describes parts of the Path class design which are in discussion. It is meant to show the current state of the discussion, so when we reach a consensus, we can delete all the discussion details and just write the decision. Please write your opinions below in the appropriate section, or start a new section. Also indicate what you agree with, so we know how close to consensus we are.

This discussion will be used to write a PEP (an alternative to PEP 355) and reference implementation.


agreed: A logical representation is better than a string representation. p[:] should behave like a tuple of path components (directories and the final filename). p[n1:n2 should return a new Path containing only the sliced components. p1 + p2 should join paths (smartly if p2 is absolute). This eliminates the need for several properties/methods: .parent, .name, .join(), .split(), etc. str(p) should return a platform-specific string representing the entire path.

Path + string

What should Path("/a/b") + "c" do? Alternatives:

  1. Join paths. Same as Path("/a/b/c").

  2. Append to the filename (useful for extensions). Same as Path("/a/bc").

  3. Raise an exception because tuple + string and list + string are illegal in Python.

Noam: I know that tuple + string are illegal, but I think that since there's an obvious way to treat the string as a path, it's ok.

Mike: Maybe. But Guido rejected / for joining; he may also reject +. Its obviousness is debatable. If we do use + for joining, we'll need APIs to modify the filename and extension without having to split/rejoin. Note: discussion of a filename/extension API is in the Filename/Extensions section below.

One sequence or several parts?

agreed: The filename or leaf directory should be the final component of the sequence, with extensions treated as part of the filename.

Should the root and drive be encoded as the first component of the sequence, or as attributes? On POSIX there is one root: "/". On Windows, each drive has its own root and current directory: r"C:\", r"D:\". There is an implied default drive, subject to .chdir. r"C:foo" is relative to drive C:'s current directory. Should we encode all this info in the first Path component, or as .isabs/.isabsolute and .root attributes? What about Windows UNC paths (r"\\a\b\c")?

Noam: Keep it all in the sequence. Sequence slicing is simple and intuitive. Attributes storing data not found in the sequence is complicating matters.

Mike: Encoding absolute/relative and drive in the sequence may be too obscure and magical.

Mike: Note that slicing off the front of an absolute path makes a relative path. Path("/a/b/c")[:1] => Path("b/c")

A seperate class for files and directories?

agreed: The same class will represent a file, directory, or symbolic link. (Reasons can be found in the wiki history)

Inheritance from str to allow easy use in other functions

Noam, Mike: This won't work. Strings must slice by character, and this is incompatible with slicing by directory component.

Inheritance from tuple

Noam: I think it works well. Guido said that he didn't like it, but I don't understand why. If all the data is stored in the sequence, I think a sequence interface should be provided. As far as I can see, the tuple interface is just that: an interface for an immutable sequence. This means that it doesn't cause any unwanted restrictions, so I don't see why not to inherit from it.

Jason: I suggest making it look like a sequence without actually subclassing tuple. It is rather strange to be subclassing tuple this way.

Noam: (My previous statement wasn't well formed) I guess this may be left to Guido's decision. I feel that subclassing from tuple is fine, but I don't really care.

Mike: the top level can emulate tuple slicing/addition to return a new Path object. It doesn't have to *subclass* tuple.

Noam: Can you please elaborate about why not to subclass from tuple?

Mike: Containment is better than inheritance. Never subclass if you can reasonably put the value in an attribute; it leads to all sorts of potential conflicts and bugs. Subclass only if the object really is a type of the superclass, and/or if the user will be calling a lot of the superclass's methods directly.

Noam: You can always use containment - you never really need to subclass. I think that if it's agreed that all the data is stored in the sequence, inheritence from tuple is ok, since we really behave like an immutable sequence, and add some operations about the sequence.

Root element storing the separator

agreed: The separator should be an attribute of the path class, not of the root element. (reasons are in version 10 of the wiki)

Mike: I don't think anybody proposed storing the separator in the root element; it was a misunderstanding. So this section can probably be deleted.


Noam, Jason, Mike: I think that immutable paths are somewhat easier to implement, and allow usage as dictionary keys. I think that if we have managed to live so far without mutable strings, we will manage to live without mutable paths. I don't see this as a major issue, but immutable paths can be somewhat more efficient: you can hash the string representation, and you can make sure you have a path by writing things like  dst=path(dst) , and if dst is already a path, no new object will be created.

In which module(s)?

Mike: A new 'basepath' module would contain the common base class. The platform modules (posixpath, ntpath, etc) already exist and are the logical place for these Path classes.

Noam: I think that all path OS subclasses fit nicely into one module. Most of the logic is in the base class, anyway, and it makes it easier to see what are the differences between each platform.

Mike: Putting code for disparate architectures in one module is asking for trouble. What if one architecture needs to import modules which aren't needed or can't be built on other architectures, especially C modules? Plus the module would become very large due to the need to accommodate Windows's intricaies (e.g., r"C:foo", r"\\uncpath").


Mike: If we use + for path joining, we need a way to create a derived path from a modified filename. Example: "I want to add a prefix or suffix to the filename portion of "/a/b/filename". Splitting/rejoining the path is messy, especially if you have to modify the base name but preserve the extensions. No specific API proposal yet.


agreed: extensions are critical, so the class must make it easy to query/modify them without splitting/rejoining the Path. Like directories, extensions have a platform-specific separator. Unlike directories, extensions are conventions rather than OS-enforced rules: not every apparent extension should be treated as such. The user must tell us when to recognize extensions, defined as N number of filename suffixes beginning with the platform's extension separator (.extsep). For instance, most users consider "filename.2005-05-13.tar.gz" and "filename.2006.05.13.tar.gz" as having two extensions each (".tar.gz"), even though the number of apparent extensions is larger.

We can put attributes/methods on the Path object, or on a special str/unicode subclass used for the filename (or for each directory component).

Noam: How should we distinguish between a file with an empty extension ("a.") and a file without an extension ("a")?

Mike: The legacy os.path.splitext() returns ".ext", so it presumably returns "." for an empty extension. We could stick with this. That prevents the ability of treating extensions platform-independently though. I doubt "a." is important enough to support though, have you ever seen it?

Mike: subclassing str is impractical due to the string/unicode duality. Why not path properties: p.ext, p.name (name without extension). The full filename is p[-1] so it doesn't need a property.

Noam: Why does the string/unicode duability makes subclassing str impractical? On Windows we can have unicode subclasses, and on POSIX we can have str subclasses. Having extension-related methods added to elements is nice because:

  • Extension is an attribute of a path element, not of the sequence of path elements. (dirs can have extensions just as well)
  • It reduces the number of methods of the path type and makes it easier to distinguish between different kinds of methods.

What should be the interface? Mike said that adding and removing extensions is important. How should it be done?

Mike: There must be a convenient way to add/delete extensions. How about p.add_ext(*exts_without_separator) and p.del_ext(n=1), each returning new Paths. The only other operations then are querying N extensions or splitting the filename into name + N extensions. (Note: if the extsep is attached, an empty string in the result would mean "there are not that many extensions".)


agreed: p.stat() and p.lstat() should return an enhanced version of Python's os.stat() object, with attributes like p.stat().mtime for all information traditionally provided by stat. Include Noam's additional properties from http://wiki.python.org/moin/AlternativePathModule. Do not have Path methods duplicating stat attributes.

Mike: Unlike os.stat(), do not support ugly attributes like .st_mtime or tuple indexing.

Mike: I formerly proposed moving all stat attributes into Path methods, because the distinction between "stat attribute" and "other file info" was arbitrarily defined by Unix tradition, but withdrew this because it's not critical. Having .path() does let the user cache the result, and having .lpath() avoids the need for a parallel set of methods that don't follow symlinks.

Finding files

Jason Orendorff's path module has three methods returning a non-recursive list of paths: listdir, files, dirs; and three methods returning a recursive iteration of paths: walk, walkfiles, walkdirs. Noam proposed combining all these plus filename globbing into one method: glob, with a special pattern "**" meaning "any subdirectory or recursive path of subdirectories".

Nick: Swiss army methods are even more evil than wide APIs. And I consider the term 'glob' itself to be a Unixism - I've found the technique to be far more commonly known as wildcard matching in the Windows world.

Noam: Can you give examples why this proposed method is evil? I think that the basic pattern idea is well defined. It gets three arguments. topdown is, I think, well defined and may be useful. onlyfiles and onlydirs are well defined and are only a convinience. I don't really mind ommitting them.

About the name "glob": I have nothing against glob, but if you find another name for the method, I might have nothing against it either.

Jason: Hard-won knowledge here: d.files('*.html') is just right. This is the common use case. glob() overgeneralizes it, forcing me to write d.glob('*.html', filesonly=True). Yuck.

Guido strongly prefers multiple APIs for distinct use cases, as opposed to a single API that serves all the use cases by providing boolean flags that toggle various aspects of its behavior.


I see what you mean. How about "glob" doing what it does in the current proposal, without the "onlyfiles" and "onlydirs" arguments, and "files" and "dirs" getting exactly the same arguments but yielding only files and directories, respectively?

About the "l" versions: Having glob, files, dirs, lglob, lfiles, ldirs seems ugly. Perhaps this should go in as a flag, say, "follow_symlinks=True"? (I would put it after pattern, because remembering the string "topdown" is easier. I don't think of any better name than "follow_symlinks". I also tend to think that it is more useful.)

Mike: Non-recursive lists: listdir, files, dirs, symlinks. Recursive iterators: walk, walkfiles, walkdirs, walklinks. All except *links should take a 'symlinks' argument, default True, meaning follow symlinks. If false, never return a symlink. The user can call *links to get the symlinks separately if desired. listdir should have a 'names_only' argument, default False, meaning return the same as os.listdir(). Doesn't a 'pattern' argument eliminate the need for .glob()?

Noam: Can you explain why you think that "listdir, files, dirs, walk, walkfiles, walkdirs" is bettern than "glob, files, dirs"? I prefer three over six. About "links" methods - Do you have examples of when they are useful? Thinking about it, it seems that dirs+files should cover all the files in the directory, when symlinks are considered directories if they point to directories in the follow_symlinks mode. About the names_only: I don't like an attribute which changes the type of the result. You can always do x[-1] to get the base name.

Mike: Combining the recursive and non-recursive methods is acceptable. They would all have to be generators in that case. .glob() is not the best name: it sounds like something else to Unix people and incomprehensible to non-Unix people. The *links() methods are useful when you want to treat symlinks specially; they eliminate an if-stanza in the main for loop. No reason to shove disparate things into the same loop. If symlinks=True, we do follow the links and inspect the actual directory/file, so we're in agreement. We can drop names_only if we add listdir(). Sometimes you just want the names, and it's a pain (and inefficient) to unpack temporary Path objects made from those same names.

Noam: About glob: Can you suggest a better name? I'm happy with glob but have nothing agains a better name.

About listdir: I prefer to omit that method. From my experience, you always want to add the base name to the dir name (what would you do with it otherwise?) I can live with the slight inefficiency and small pain of making a path and taking only the last element on the rare cases in which it's needed. I prefer the "one way to do it" approach here.

About symlinks: I see what you mean. I prefer one iteration with an if stanza, since I then iterate over the contents of a directory only once, but it seems like a reasonable friend of "dirs" and "files". The name "link" is ok, but we should make sure that all symlinks are referred to as "links" in the method names - I don't want to remember when it's a link and when it's a symlink. If so, the "link" method should be renamed "hardlink".

But "lfiles", "ldirs" are so ugly...

Mike: p.listdir() => os.listdir(str(p)) is small, simple, and unobtrusive; it won't bother anybody except purists. Say you need the filenames for a GUI list box or a menu. It's hard to find a name that means ".dirs plus .files"; maybe .walk(recursive=False) is OK. That will surprise existing users of walk functions, but we haven't found a better name. I agree we should be consistent about .(sym)links methods; maybe we should rename .link to .hardlink because it's so rarely used. 'follow_symlinks' as an argument is also acceptable; it's wordy but perhaps better self-explanatory than 'symlinks'. Down with the 'l' versions!


Noam: I removed expand. There's no need to use normpath, so it's equivalent to .expanduser().expandvars(), and I think that the explicit form is better.

Mike: Expand is useful though, so you don't forget one or the other.

Noam: I wouldn't want to call expandvars() by default - I think that expanding environment variables is something that should be done with care, as it may expose info about the environment which should be kept private. Anyway, I think that p.expanduser().expandvars() shows exactly what is being done and isn't a lot longer, so I prefer it.


Mike: Er, not sure I've used it, but it seems useful. Why force people to reinvent the wheel with their own recursive loops that they may get wrong?


Because the handling of exceptional cases is almost always going to be application specific. Note that even os.walk provides a callback hook for if the call to os.listdir() fails when attempting to descend into a directory.

For copytree, the issues to be considered are significantly worse:

  • - what to do if listdir fails in the source tree? - what to do if reading a file fails in the source tree? - what to do if a directory doesn't exist in the target tree? - what to do if a directory already exists in the target tree? - what to do if a file already exists in the target tree? - what to do if writing a file fails in the target tree? - should the file contents/mode/time be copied to the target tree? - what to do with symlinks in the source tree?

Now, what might potentially be genuinely useful is paired walk methods that allowed the following:

   # Do path.walk over this directory, and also return the corresponding
   # information for a destination directory (so the dest dir information
   # probably *won't* match that file system
   for src_info, dest_info in src_path.pairedwalk(dest_path):
       src_dirpath, src_subdirs, src_files = src_info
       dest_dirpath, dest_subdirs, dest_files = dest_info
       # Do something useful

   # Ditto for path.walkdirs
   for src_dirpath, dest_dirpath in src_path.pairedwalkdirs(dest_path):
       # Do something useful

   # Ditto for path.walkfiles
   for src_path, dest_path in src_path.pairedwalkfiles(dest_path):

Jason: I think Python needs high-level APIs to do stuff like copytree(). The current state of affairs is just awful. On Unix I can do os.system('cp ' + ...), but it's not portable.

I haven't tried pairedwalkfiles(), so no opinion.

Mike: .pairedwalk() and friends may be useful. The user wants to know which files/directories to create, update, and delete. So it's essentially a diff report.

Noam: I'm not sure about pairedwalk() - it may be a bit complicated, I'm afraid. However, perhaps copytree() isn't such a big deal if it works only when the source is a directory and the destination doesn't exist. Then, exceptions aren't expected, so if they happen they can simply be propagated.



OK, this is one case where a swiss army method may make sense. Specifically, something like:

  • def copy_to(self, dest, copyfile=True, copymode=True, copytime=False)

Whether or not to copy the file contents, the permission settings and the last access and modification time are then all independently selectable.

The different method name also makes the direction of the copying clear (with a bare 'copy', it's slightly ambiguous as the 'cp src dest' parallel isn't as strong as it is with a function).

Noam: I think the different name and arguments are a good idea. What exactly does the copyfile argument mean?

Jason: Definitely agree with Nick.

Noam: What about copyto? It's easier to write, I think that it's not hard to understand, and perhaps it focuses less attention on the "to", making it look like a special kind of copy.

Mike: src.copy(dest, content=True, mode=False, time=False). .copy_to is OK, .copyto is bad. Almost everybody expects .copy to mean .copy_to and not .copy_from.

Noam: Please, what does the content/copyfile argument mean? About copymode vs. mode: I prefer copymode. "mode" seems like a mode specification (like in mkdir), not like a boolean.

Mike: 'content' means copy the file contents. If content=False and the destination doesn't exist, create an empty file. If content=False and the destination does exist, copy the file attributes only but don't modify the content. This covers all copying use cases.

Noam: Why should you want to copy only attributes? Can you give an example?

Mike: When you want to make the perms/mtime of one file match another file. When you want to create an empty log file with the same perms/mtime as another file, because the logging program will have modify permission but not create permission.


Noam: Someone with experience with unicode filenames, please help!

Jason: I have some experience, not a ton.

In the Win32 API, paths are Unicode strings. To produce a path-string you'll have to decode any non-Unicode strings in your tuple; Python's default encoding is one option, but the operating system's default encoding is another option; I think the latter is what the os functions do on Windows.

In the POSIX API, paths are char strings, which means 8-bit strings on every platform I'm familiar with. The character set varies from system to system. Some use UTF-8.

It's kind of squirrely if you allow both 8-bit strings and Unicode strings in your tuple. I suggest using only Unicode within the tuple and converting to 8-bit only as needed to talk to POSIX.


Thanks for the explanation. I agree about not mixing different kinds of strings. Is there a good way to convert unicode strings into file names on POSIX? How do you know the right encoding?

Mike: At first I thought about forcing everything to Unicode on input and adding 'encoding' and 'onerror' arguments to the constructor. That doesn't solve the problem of chosing the charset to encode on output. But now I'm wondering if we should just preserve whatever type(s) the user inputs.

Noam: I don't think that preserving the type of the user input will work: You'll still have to decode it to str on POSIX. It seems to me that the only solution is to use the native "alphabet" of the system: Unicode chars on Windows, and byte chars on POSIX. To put it more clearly: All elements on Windows will be unicode, all elements on POSIX will be str.

Obsoleting other modules


I don't believe it's a given that a nice path object will obsolete the low level operations. When translating a shell script to Python (or vice versa), having access to the comparable low level operations would be of benefit.

At most, I would expect provision of an OO path API to result in a comment in the documentation of various modules (os.path, shutil, fnmatch, glob) saying that "pathlib.Path" (or whatever it ends up being called) is generally a more convenient API.

Noam: I don't mind obsoleting os.path, shutil, fnmatch, glob, as I see them as high-level operations. I don't mind not obsoleting them either - it may keep the code more organized if different operations are in differnt modules. I agree that most of the functions in the os module shouldn't be obsoleted - these are really low-level operating system operations, and you shouldn't need to use a complex path object in order to call them.

Jason: The new API should be the one high-level API for this type of stuff. All the other high-level APIs should be obsoleted.

Mike: We cannot deprecate the existing functions in Python 2.x; too many existing programs would break. But we can discourage them in the documentation.

Additional methods/attributes


Mike: Delete "it" recursively if it exists, whatever it is. This is convenient when you don't care whether it's a file or directory, you just want to overwrite it, and you don't want to take six lines of code to do it.

Noam: Why six lines of code? I count four:

if p.isfile():
elif p.isdir():

We can have rmtree work also for files, and even for non-existing paths, but I'm not sure it's a good idea.

Mike: .rmtree would go away if .purge is added. So we'd have to inline its implementation. The main reason for .purge is .rmtree raises exceptions if (A) the Path is a file, or (B) the Path doesn't exist, and you don't want to clutter your code for all those cases when you just want to write or remove "it".

Noam: I feel fine with the four lines above, but I can live with another method. We can bring this to python-dev decision.

Mike: Adding the two capabilities to .rmtree would be functionally the same. I think .purge is a better name though.


Mike: These should succeed silently if the operation is already done. Otherwise the user has to write an unnecessary "if p.exists():" around it. If the user really cares whether the item exists, he can explicity write the if-stanza. If not, he shouldn't be forced to clutter his code, especially since that obscures whether it does matter or not that the item existed.

AlternativePathDiscussion (last edited 2008-11-15 13:59:41 by localhost)

Unable to edit the page? See the FrontPage for instructions.