Python Quick Guide

Get Help

>>> help(str)
Help on class str in module builtins:

>>> help(str.find)
Help on method_descriptor:

find(...)
    S.find(sub[, start[, end]]) -> int

    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.

    Return -1 on failure.

Collection

Lists, Tuples, Sets and Dictionaries

Summary

  • Lists: containers to hold multiple elements in order
  • Tuples: similar to lists, but immutable
  • Sets: containers to hold multiple element when membership instead of order or position is important
  • Dictionaries: key-value pairs

List highlights

# a list can hold elements of different types
>>> x = [1, 2, 3, "abc", [4, 5]]

# slicing is a widely used operation
# [start index: end index: step]
>>> x[3:1:-1]
['abc', 3]
# perform in place modification with slicing
>>> x = [1, 2, 3, "abc", [4, 5]]
>>> x[3:] = [4]
>>> x
[1, 2, 3, 4]
# in place 'filtering'
>>> x[:] = [e for e in x if e % 2 == 0]
>>> x
[2, 4]


# in-place sort vs. returning a sorted list
# in-place sort
>>> countries = ["China", "USA", "Australia"]
>>> countries.sort(key=lambda x: len(x))
>>> countries
['USA', 'China', 'Australia']
# sorted built-in function returns a sorted list
>>> countries = ["China", "USA", "Australia"]
>>> sorted(countries, key=lambda x: len(x))
['USA', 'China', 'Australia']

# shallow copy vs. deep copy
>>> l1 = [["x"], "y"]
# shallow copy via slicing
>>> l1_sc = l1[:]
# deep copy
>>> import copy
>>> l1_dc = copy.deepcopy(l1)

Tuple highlights

# `,` is needed for single element tuple
>>> type((1,))
<class 'tuple'>
>>> type((1))
<class 'int'>

# tuple may be immutable, but NOT hashable
>>> x = (1,2,[3])
>>> type(x)
<class 'tuple'>
# tuple itself is immutable, but its content may be mutable
>>> x[2].extend([4,5])
>>> x
(1, 2, [3, 4, 5])

# swap variable values with tuple and packing/unpacking
>>> x = 5
>>> y = 23
>>> x,y = y,x # y,x is packed into a tuple and then unpacked for assignment
>>> x
23
>>> y
5

Set highlights

# items in a set must be both immutable and hashable
>>> set((1,2,3))
{1, 2, 3}
>>> set((1,2,[3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

# duplicate items are removed when adding to set
>>> s = {1,2,3,4,5,2,3}
>>> s
{1, 2, 3, 4, 5}
>>> s.add(5)
>>> s
{1, 2, 3, 4, 5}

# a set itself is not immutable and hashable
# to put a set inside another set, use frozenset
>>> {1,2,3,{4,5}}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
>>> {1,2,3,frozenset({4,5})}
{frozenset({4, 5}), 1, 2, 3}

Dictionary highlights

# widely used `items` function
>>> for k,v in {"China": 5, "USA": 3}.items():
...     print(f"{k} --> {v}")
...
China --> 5
USA --> 3

# to delete an entry, use `del`
>>> d = {"China":5, "USA":3}
>>> del d["USA"]
>>> d
{'China': 5}

# provide default value when the key does NOT exist in the dict
# `dict.get(key, dflt_val)`
# `dict.setdefault(key, dflt_val)`
>>> d = {"China":5, "USA":3}
>>> d.get("Japan", 5)
5
>>> d.setdefault("Korea", 5)
5
>>> d["Korea"]
5

Dictionaries can be used as caches to avoid recalculation

cal_cache = {}
def calc(param):
    if param not in cal_cache:
        # calculate and then store the result into cache
        result = calculate(param)
        cal_cache[param] = result
    return cal_cache[param]

Comprehension

Don’t loop if a comprehension can do it cleaner.

# list comprehension
>>> [e*e for e in [1,2,3]]
[1, 4, 9]

# set comprehension
>>> {e*e for e in {1,2,3}}
{1, 4, 9}

# dict comprehension
>>> {k.upper() : v*2 for k, v in {"a":1, "b":2}.items()}
{'A': 2, 'B': 4}

Strings

Strings can be treated as sequences of chars, so operations like slicing can be performed on strings.

>>> "Hello"[-1::-1]
'olleH'

Numeric and unicode escape sequences can be used to present strings.

>>> "\x6D"
'm'
>>> "\u2713"
'✓'
>>> '\u4F60\u597D'
'你好'

Strings are immutable so methods return new strings, although they look like updating the string contents in place.

>>> "hello, world".title()
'Hello, World'

>>> "C++++".replace("++","+")
'C++'

The string module defines some useful constants.

>>> import string
>>> string.digits
'0123456789'
>>> string.hexdigits
'0123456789abcdefABCDEF'
>>> string.whitespace
' \t\n\r\x0b\x0c'
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'

Formal string representation vs. Informal string representation

  • repr: formal string representation of a Python object. The returned string representation can be used to rebuilt the original object, just like serialization/deserialization. It’s great for debugging programs.
  • str: informal string representation of a Python object. It’s intended to be read by humans. str applied to any built-in Python object always call repr
>>> repr([1,2,3])
'[1, 2, 3]'
>>> str([1,2,3])
'[1, 2, 3]'

String interpolation is available since version 3.6. It’s called f-string.

>>> value = 523
>>> f"The value: {value}"
'The value: 523'

# function can be called
>>> lang = "go"
>>> f"The next one: {lang.upper()}"
'The next one: GO'

Bytes

String vs. Bytes

  • A string object is an immutable sequence of Unicode characters.
  • A bytes object is a sequence of integers with values from 0 to 256, mainly for dealing with binary data.

Two confusing items

  • unicode: a set of characters
  • utf-8: an encoding standard, which is used to present unicode. With different encodings, unicode will be presented with different values.
>>> c = "\u2713"
>>> c
'✓'

# try different encoding
>>> c.encode(encoding='utf-16')
b"\xff\xfe\x13'"
>>> c.encode(encoding='utf-8')
b'\xe2\x9c\x93'

# encoded value back to string, and by default utf-8 is the encoding/decoding standard.
>>> b'\xe2\x9c\x93'.decode()
'✓'

Control Flow

The ’ladder’ structure is like below

if condition1:
   body1
elif condition2:
   body2
elif condition3:
   body3
...
elif condition(n-1):
   body(n-1)
else:
   body(n)

pass can be used if an empty body of if or else is neede.

if cond:
    pass
else:
    # do something else

A dictionary can be used to ease the ’ladder’ structure.

def take_action_a():
    # do something

def take_action_b():
    # do something else

def take_action_c():
    # do another thing

func_dict = {'a': take_action_a,
             'b': take_action_b,
             'c': take_action_c}

# populate the desired function key, and here simple assign 'a' for demo purpose
desired_func_key = 'a' 
func_dict[desired_func_key]()

for Loop

for loop is different from the one in ‘C family’ programming langauges. In Python, for iterates over the values returned by any iterable object, so it’s more like an iterator, instead of a loop structure.

>>> for elt in [1,2,3,4,5]:
...     if elt % 2 == 0:
...         print(elt)
...
2
4

Unpacking is supported by for.

>>> for idx, val in enumerate(["A", "B", "C"]):
...     print(f"{idx}: {val}")
...
0: A
1: B
2: C

range, Generator and Memory Usage

When dealing with list holding large amount of elements, we may encounter the memory usage issue. Compare the memory consumption of a list and a range.

>>> import sys

>>> sys.getsizeof(list(range(1000000)))
8000056

>>> sys.getsizeof(range(1000000))
48

So using range or generator can reduce the strain on memory.

>>> x = list(range(1_000_000))
# using generator expression, we don't have to 'duplicate' the size of `x`
>>> g = (elt * elt for elt in x)
>>> import sys
>>> sys.getsizeof(g)
104

Boolean Values for Conditions

In Python

  • 0 or empty values are False.
  • Any other values are True.

Some practical terms

  • Values like 0.0 and 0+0j are False.
  • Empty String "" is False.
  • Empty list [] is False.
  • Empty dictionary {} is False.
  • The special value None is always False.

Some objects, such as file objects and code objects don’t have a sensible definition of 0 or empty element, so they should NOT be used in a Boolean context.

Some boolean related operators

  • in and not in to test the membership
  • is and is not to test the identity
  • and, or, and not to combine boolean values

Operators

==/!= vs. is/is not

Equality vs. Identity

  • ==/!=: to test the equality
  • is/is not: to test the identity
>>> l1 = [1,2,3]
>>> l2 = [1,2,3]
>>> l1 == l2
True
>>> l1 is l2
False

and and or Used in Non-Boolean Context

and and or can be used in non-boolean context to ‘pick’ the object.

  • and: pick the first false object or the last object
  • or: pick the first true object or the last object
>>> "a" and "" and "c"
''
>>> "a" and "b" and "c"
'c'

>>> "a" or "" or "c"
'a'
>>> "" or "" or "c"
'c'

Alternative to Ternary Operator ? :

Some programming languages provide the ternary opeator ? : such as below javascript code snippet

name = 1 ? "Yang" : "Yin"
console.log(name)

However, there is NO such ternary operator in Python. Python chooses a more readable style

>>> name = "Yang" if 1 else "Yin"
>>> print(name)
Yang

Functions

The basic function definition is like below

>>> def double(x):
...     return x * 2
...

>>> double(5)
10

# function without paramters
>>> def subroutine():
...     print("This is subroutine")
...
>>> subroutine()
This is subroutine

# function without explicit return
# in this case `None` is returned
>>> def no_explicit_return():
...     print("No explicit_return")
...
>>> r = no_explicit_return()
No explicit_return
>>> r is None
True

Parameters

Three available options for function parameters

  • Positional parameters
  • Named parameters
  • Variable numbers of parameters

Named parameters help remove the ambiguity in some cases

>>> def power(base, exponential):
...     if exponential == 0:
...         return 1
...     else:
...         return base * power(base, exponential-1)
...

# using named parameter we know it's cube of 2, not square of 3
>>> power(base = 2, exponential = 3)
8

In addition, Named parameters help in the default value case

>>> def greet(message="Hello", name="world"):
...     print(f"{message}, {name}")
...

# use the default value of `message` parameter
>>> greet(name="NYC")
Hello, NYC

Variable numbers of parameters allow the function to handle arbitrary numbers of parameters. There is no method overloading in Python like the one in Java, and variable numbers of parameters can be used to mimic the feature. In addition, decorator pattern can be implemented with variable numbers of parameters.

def decrate(fn):
    def decorated_fn(*parameters, **key_val_pairs):
        print("Doing decoration tasks...")
        fn(*parameters, **key_val_pairs)
        print("End\n")
    return decorated_fn

def greet(message="Hello", name="world"):
    print(f"{message}, {name}")

decorated_greet = decrate(greet)
decorated_greet("Hi")
decorated_greet(name="NYC")

Functions as the First-class Citizens

Functions can be assigned to variables, just as other values in Python.

>>> def foo():
...     print("foo funciton!")
...
>>> fn = foo
>>> fn()
foo funciton!

Anonymous functions are implemented as lambda expressions.

>>> fn = lambda: print("bar function!")
>>> fn()
bar function!
>>> fn
<function <lambda> at 0x000002E3BDD3EB90>

High order functions are supported ’natively’, since functions are first-class citizens.

# a function can accept functions and return function
def combine(outer_fn, inner_fn):
    def combined_fn(*parameters, **key_values):
        return outer_fn(inner_fn(*parameters, **key_values))
    return combined_fn

def square(x):
    return x * x

dbl = lambda x: x * 2

double_of_squred = combine(dbl, square)
r = double_of_squred(5)
print(r)

Scope: global and nonlocal

Local variables vs. global variables vs. nonlocal variables

  • local variables: variables defined in the function
  • global variables: variabels defined outside the function
  • nonlocal variables: variables defined in the ’enclosing’ scope

Compare local variables and global variables.

a = 10

def foo():
    # global a
    a = 20
    print(f"a in foo: {a}")

foo()
print(f"global a: {a}")

# we get below result
#   a in foo: 20
#   global a: 10

# when `global a` is uncommented, we get below result
#   a in foo: 20
#   global a: 20

nonlocal refers to the one defined in the enclsoing function.

a = 10

def foo():
    a = 20
    def bar():
        nonlocal a
        a = 30
        print(f"a in bar: {a}")
    bar()
    print(f"a in foo: {a}")

foo()
print(f"global a: {a}")

# we get below result
#   a in bar: 30
#   a in foo: 30
#   global a: 10

Generator Functions

Besides generator expressions, there are generator functions to help on better memory usage.

# generator function
def gen_1_m():
    i = 1
    while i < 1_000_000:
        yield i
        i = i + 1

s = 0
for elt in gen_1_m():
    s = s + elt

print(s)

yield from can be used to delegate the generator to another generator.

g1 = range(1,500_000)
g2 = range(500_000,1_000_000)

def gen_1_m():
    yield from g1
    yield from g2

s = 0
for elt in gen_1_m():
    s = s + elt
print(s)

Modules and Scoping Rules

What is a module?

  • a file containing Python code, which defines Pythong functions or objects
  • name of the file defines the name of the module

Why use modules?

  • for better organizing source code
  • modules help avert name-clash issue. Suppose two people both define greet function.
    • module_a.greet
    • module_b.greet

To use a module, import it first.

# import the built-in `math` module
>>> import math

# check the members of the module
>>> dir(math)
['__doc__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'comb', 'copysign', 'cos', 'cosh', 'degrees', 'dist', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'isqrt', 'lcm', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'nextafter', 'perm', 'pi', 'pow', 'prod', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc', 'ulp']

# reference to `pi`defined in `math`
>>> math.pi
3.141592653589793

Another import form is from <module> import <member/*>

>>> from math import pi
>>> pi
3.141592653589793

# we can even import all members using `*`
>>> from math import *
>>> gcd
<built-in function gcd>

The Module Search Path

To make module files available to Python to import, put it under any path entries defined in sys.path

>>> import sys
>>> sys.path
['', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\python310.zip', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\DLLs', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\lib', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages']

Note

  • The first module file found in the entries is used.
  • If no desired module can be found, an ImportError exception is raised.

How to define the path entries in sys.path list?

  • sys.path list is initialized based on PYTHONPATH environment variable if it exists.
  • Define .pth file to indicate the path entries, and put the .pth file under the directory defined by sys.prefix

Scoping Rules and Namespaces

A namespace maintains the mapping from identifiers to objects. A statement like x = 1 adds x to a namespace and associates x with the value 1.

In Python there are three namespaces

  • local: holding local functions and variables
  • global: holding module functions and module variables
  • built-in: holding built-in functions

When Python needs to ’locate’ the identifier, it follows below sequence

  1. Check local namespace.
  2. If the identifier doesn’t exist in local namespace, check global namespace.
  3. If the identifier doesn’t exist in global namespace, check built-in namespace.
  4. If the identifier doesn’t exist in any of above, NameError occurs.

When a function call is made, a local namespace is created.

def foo():
    x = 1
    print(f"In foo locals: {locals()}")
    print(f"In foo globals: {globals()}")

y = 2
foo()
# on global level, locals() is equivalent to globals()
print(locals() == globals())
print(dir(__builtins__))

# executing above code snippet, we get
In foo locals: {'x': 1}
In foo globals: {'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x0000013F1797C700>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__file__': 'C:\\sandbox\\PythonLab\\Scripts\\lab.py', '__cached__': None, 'foo': <function foo at 0x0000013F178B3E20>, 'y': 2}
True
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning', 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError', 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EncodingWarning', 'EnvironmentError', 'Exception', 'False', 'FileExistsError', 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'ModuleNotFoundError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError', 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'WindowsError', 'ZeroDivisionError', '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__', '__package__', '__spec__', 'abs', 'aiter', 'all', 'anext', 'any', 'ascii', 'bin', 'bool', 'breakpoint', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'vars', 'zip']

Interaction between Python Program and System

Combine Script and Module

A Python program can be treated as a script or a module depending on the execution context. The structure below does the trick.

if __name__ == '__main__':
    main()
else:
    # module-specific initialization code if needed

When the Python file is executed as Python script, its __name__ is set to __main__.

Commandline Arguments

The arguments passed from commandline can be retrieved via sys.argv.

import sys

def main():
    print(sys.argv)

main()

sys.argv is a list

  • The first element is the name of the script file.
  • The following elements are the arguments passed from commandline.
PS C:\sandbox\PythonLab\TempLab> python .\my_sciprt.py Hello World "Test Script"
['.\\my_sciprt.py', 'Hello', 'World', 'Test Script']

# omit `.\` to invoke the script file
PS C:\sandbox\PythonLab\TempLab> python my_sciprt.py Hello World "Test Script"
['my_sciprt.py', 'Hello', 'World', 'Test Script']

Use argparse module if more advanced features are needed to handle commandline arguments.

Filesystem Interaction

File Paths

Path related modules

  • os.path: before Python 3.5, and it’s imperative style.
  • pathlib: since Python 3.5, and it’s OO style.

os.path provides a useful abstraction layer to ease operations on filesystems. For example, file path separator may be differnt from OS to OS.

  • \ in Windows OS
  • / in *nix OS

Using os.path.sep, we don’t have to worry about the difference. As a result, program with that abstraction layer

  • use os.path.curdir
  • NOT use .

Unfortunately, there is no unified concept of root path. Think about the path types we have in Windows OS

  • C:\ means the C drive
  • \\myftp\share\ means a UNC root path As a result, we do NOT have something like os.path.root in Python.

To form a path

# form a path with os.path
os.path.join("c:/", "Sandbox", "Temp")

# form a path with pathlib
# note `joinpath` of Path object is an instance method
pathlib.Path().joinpath("c:/", "Sandbox", "Temp")
pathlib.Path() / "c:/"/ "Sandbox"/ "Temp"
pathlib.Path("c:/") / "Sandbox"/ "Temp"

Filesystem Operations

Filesystem operations are performed via os module. Don’t get confused with sys module. Think ‘sys’ as ‘Python System’.

# change directory
os.chdir("My Target Dir")

# print current working directory
os.getcwd()

# list items in the directory
# Note in windows, we may encunter PermissionError if the dir is read-only
os.listdir(os.path.curdir)

# get file/dir info
os.path.exist(path_as_arg))
os.path.isfile(path_as_arg))
os.path.isdir(path_as_arg))
os.path.getsize(path_as_arg))
os.path.getatime(path_as_arg))

# renme file/dir
os.rename("original", "target")

# remove a file
# `remove` function cannot remove a directory
os.remove("file_to_be_removed")
# `rmdir` can remove an empty directory
os.rmdir("empty_dir_to_be_removed")

# create a directory
os.mkdir("dir_name")
os.makedirs("aut_create_intermediate_dirs")

If OO style is preferred, use pathlib module. With pathlib we create different objects to represent different paths, so we don’t do operation like pathlib_obj.chdir("Target Dir")

# create the obj representing the current dir
curr_dir = pathlib.Path()

# create the obj representing the specified path
root_dir = pathlib.Path("/")

# list the items in the directory
for fs_item in curr_dir.iterdir():
    print(fs_item) # fs_item is Path object as well

# print current working directory
# below two expressions return the same value
# note current working directory is determined by where we started Python program and if we switched to another dir later
curr_dir.cwd()
root_dir.cwd()

# get file/dir info
path_obj.exists()
path_obj.is_file()
path_obj.is_dir()
path_obj.stat()

# rename a file or directory
path_obj.rename("new_name")

# remove a file
path_obj.unlink()
# remove an empty directory
path_obj.rmdir()

# create a directory
path_obj.mkdir() # requires intermediate directories exist
path_obj.mkdir(parents=True) # intermediate directories will be created automatically

Utilities for Filesystem Operation

os.scandir provides an easy approach to get metadata of filesystem entries under a directory.

# use a context manager to ensure the file descriptor is released
# regardless of whether the iterator is full iterated
with os.scandir(os.curdir) as my_dir:
    for fs_entry in my_dir:
        print(f"{fs_entry.name}: {fs_entry.stat()}")

glob.glob provides the globbing functionality.

import glob
os.chdir("c:/sandbox/pythonlab/scripts")
py_files = glob.glob("*.py")

for py_file in py_files:
    print(f"Python File: {py_file}")

shutil.rmtree can remove a non-empty directory, and shutil.copyree can recursively make copies of all the files and subdirectories in a given directory.

import shutil

shutil.rmtree(nonempty_dir_to_be_removed)

shutil.copytree(src, dst)

os.walk(directory, topdown=True, onerror=None, followlinks=False) traverses directory structure recursively. The function returns three things

  • root or path of the directory
  • a list of its subdirectories (os.walk will be called on each subdir respectively)
  • a list of its files
for root, subdirs, files in os.walk("Test"):
    for file in files:
        print(f"file name: {file}")
    # remove backup directory from the recursion
    subdirs[:] = [e for e in subdirs if e != "backup"]
    print(f"Subdir list now is {subdirs}")

Note

  • If topdown is True or not present, the files in each directory are processed before moving to subdirectories. That means we have a chance to remove some subdirectories, such as .git/, .config/ from the recursion.

File I/O

Open and Close Files

The classic open-process-close file operation is like below

file_obj = open("c:/temp/hello.txt")
print(file_obj.readline())
file_obj.close()
print(f"File closed? {file_obj.closed}")

Using context managers, we don’t need to explictly close the file.

with open("c:/temp/hello.txt") as file_obj:
    print(file_obj.readline())

Specify the mode to open file with

  • r: read mode, the default mode
  • w: write mode, data in file will be truncated before writing operation
  • a: append mode, new data will be appended to the end of the file
  • x: new file mode, it throws FileExistsError if the file exists already
  • +: read and write mode
  • t: text mode, the default mode
  • b: binary mode, it supports random access

With above modes, we have

  • rt: read as text
  • w+b: random accessing the file in binary mode with truncating the file first
  • r+b: random accessing the file in binary mode without truncating the file first

In addition, pay attention to below options when open the file

  • encoding: sepcify the encoding to open the file with
  • newline: different operating systems may use different characters as the new line character

Suppose we have a txt file containing below unicode chars with utf-8 encoding

✓💓🍁

We can specify the encoding as utf-8 when open the file

with open("c:/temp/unicode.txt", encoding="utf-8") as file_obj:
    print(file_obj.read(1))
    print(file_obj.read(1))
    print(file_obj.read(1))

Read and Write with TextIOWrapper

In most cases, read, readline and readlines without argument are good enough to handle file reading. However, there will be some exceptional cases like

  • the file is too large
  • the line contains too many contents
  • there are too many lines

Two approaches to tackle the issue

  • provide additional arguments to affect the amount of data being read every time
  • use iterator to lazily load file contents
# argument to affect the amount of data being read every time
size_to_read = 50
with open("c:/temp/the_zen_of_python.txt", mode="rt") as file_obj:
    while sized_content := file_obj.read(size_to_read):
        print(sized_content, end='')

# treat file object as generator
# `open` returns a file object which is an iterator
# `isinstance(fo, collections.abc.Iterator)` returns True
with open("c:/temp/the_zen_of_python.txt", mode="rt") as file_obj:
    for line in file_obj:
        print(line, end="")

Note

  • size parameter of readline indicates the max size of chars to read before encoutering the newline character, so we may read less than the size on some lines.
  • hint parameter of readlines indicates the size of chars to be exceeded by reading lines, so we may read an ’extra’ line, just for exceeding the hint size.

We perform ‘write’ operation mainly with functions

  • write
  • writelines

Below code snippet implements a dummy version of ‘copy’

# dummy copy
import os

size_of_chunk = 128
source_file = os.path.join("C:/", "temp", "the_zen_of_python.txt")
target_file = os.path.join("C:/", "temp", "zen.txt")
# binary mode so both binary files and text files can be handled
with open(source_file, "rb") as sf_obj:
    with open(target_file, "wb") as tf_obj:
        while content_chunk := sf_obj.read(size_of_chunk):
            print(">", end="")
            tf_obj.write(content_chunk)

print("Done")

Read and Write with pathlib

pathlib provides OO style read/write operations. It encapsulates actions like ‘open’ and ‘close’, so we don’t need to do them by ourselves. Below are the related functions

  • pathlib.Path.write_bytes
  • pathlib.Path.write_text
  • pathlib.Path.read_bytes
  • pathlib.Path.read_text
# dummy copy via pathlib's OO style
import pathlib

source_file = pathlib.Path() / "C:/" / "temp" / "the_zen_of_python.txt"
target_file = pathlib.Path() / "C:/" / "temp" / "zen.txt"

target_file.write_bytes(source_file.read_bytes())
print("Done using pathlib")

read_bytes and read_text don’t provide a paramter to specify the chunk size to read each time, and those functions read the entire file into memory. If memory-efficient is important, use the open function of the Path object to get the ‘file object’ and then work as the classic open style

import pathlib

chunk_size = 128
source_file = pathlib.Path() / "C:/" / "temp" / "the_zen_of_python.txt"
target_file = pathlib.Path() / "C:/" / "temp" / "zen.txt"

with source_file.open(mode="rb") as sf_obj:
    with target_file.open(mode="wb") as tf_obj:
        while chunk := sf_obj.read(chunk_size):
            tf_obj.write(chunk)
            
print("Done!")

File as Standard Out

A file can be set as stdout, so that print function will write the content to the file instead of to the terminal.

import sys

with open("c:/temp/output.txt", mode="wt") as of_obj:
    sys.stdout = of_obj
    print("Hello")
    print("World")
    # reset stdout back
    sys.stdout = sys.__stdout__
    print("Hi")

Alternative to setting sys.stdout to a file, in each print we can set the file parameter to the specified file.

Built with Hugo
Theme Stack designed by Jimmy