Get Help
>>> help(str)
Help on class str in module builtins:
>>> help(str.find)
Help on method_descriptor:
find(...)
S.find(sub[, start[, end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
Collection
Lists, Tuples, Sets and Dictionaries
Summary
- Lists: containers to hold multiple elements in order
- Tuples: similar to lists, but immutable
- Sets: containers to hold multiple element when membership instead of order or position is important
- Dictionaries: key-value pairs
List highlights
# a list can hold elements of different types
>>> x = [1, 2, 3, "abc", [4, 5]]
# slicing is a widely used operation
# [start index: end index: step]
>>> x[3:1:-1]
['abc', 3]
# perform in place modification with slicing
>>> x = [1, 2, 3, "abc", [4, 5]]
>>> x[3:] = [4]
>>> x
[1, 2, 3, 4]
# in place 'filtering'
>>> x[:] = [e for e in x if e % 2 == 0]
>>> x
[2, 4]
# in-place sort vs. returning a sorted list
# in-place sort
>>> countries = ["China", "USA", "Australia"]
>>> countries.sort(key=lambda x: len(x))
>>> countries
['USA', 'China', 'Australia']
# sorted built-in function returns a sorted list
>>> countries = ["China", "USA", "Australia"]
>>> sorted(countries, key=lambda x: len(x))
['USA', 'China', 'Australia']
# shallow copy vs. deep copy
>>> l1 = [["x"], "y"]
# shallow copy via slicing
>>> l1_sc = l1[:]
# deep copy
>>> import copy
>>> l1_dc = copy.deepcopy(l1)
Tuple highlights
# `,` is needed for single element tuple
>>> type((1,))
<class 'tuple'>
>>> type((1))
<class 'int'>
# tuple may be immutable, but NOT hashable
>>> x = (1,2,[3])
>>> type(x)
<class 'tuple'>
# tuple itself is immutable, but its content may be mutable
>>> x[2].extend([4,5])
>>> x
(1, 2, [3, 4, 5])
# swap variable values with tuple and packing/unpacking
>>> x = 5
>>> y = 23
>>> x,y = y,x # y,x is packed into a tuple and then unpacked for assignment
>>> x
23
>>> y
5
Set highlights
# items in a set must be both immutable and hashable
>>> set((1,2,3))
{1, 2, 3}
>>> set((1,2,[3]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
# duplicate items are removed when adding to set
>>> s = {1,2,3,4,5,2,3}
>>> s
{1, 2, 3, 4, 5}
>>> s.add(5)
>>> s
{1, 2, 3, 4, 5}
# a set itself is not immutable and hashable
# to put a set inside another set, use frozenset
>>> {1,2,3,{4,5}}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
>>> {1,2,3,frozenset({4,5})}
{frozenset({4, 5}), 1, 2, 3}
Dictionary highlights
# widely used `items` function
>>> for k,v in {"China": 5, "USA": 3}.items():
... print(f"{k} --> {v}")
...
China --> 5
USA --> 3
# to delete an entry, use `del`
>>> d = {"China":5, "USA":3}
>>> del d["USA"]
>>> d
{'China': 5}
# provide default value when the key does NOT exist in the dict
# `dict.get(key, dflt_val)`
# `dict.setdefault(key, dflt_val)`
>>> d = {"China":5, "USA":3}
>>> d.get("Japan", 5)
5
>>> d.setdefault("Korea", 5)
5
>>> d["Korea"]
5
Dictionaries can be used as caches to avoid recalculation
cal_cache = {}
def calc(param):
if param not in cal_cache:
# calculate and then store the result into cache
result = calculate(param)
cal_cache[param] = result
return cal_cache[param]
Comprehension
Don’t loop if a comprehension can do it cleaner.
# list comprehension
>>> [e*e for e in [1,2,3]]
[1, 4, 9]
# set comprehension
>>> {e*e for e in {1,2,3}}
{1, 4, 9}
# dict comprehension
>>> {k.upper() : v*2 for k, v in {"a":1, "b":2}.items()}
{'A': 2, 'B': 4}
Strings
Strings can be treated as sequences of chars, so operations like slicing can be performed on strings.
>>> "Hello"[-1::-1]
'olleH'
Numeric and unicode escape sequences can be used to present strings.
>>> "\x6D"
'm'
>>> "\u2713"
'✓'
>>> '\u4F60\u597D'
'你好'
Strings are immutable so methods return new strings, although they look like updating the string contents in place.
>>> "hello, world".title()
'Hello, World'
>>> "C++++".replace("++","+")
'C++'
The string module defines some useful constants.
>>> import string
>>> string.digits
'0123456789'
>>> string.hexdigits
'0123456789abcdefABCDEF'
>>> string.whitespace
' \t\n\r\x0b\x0c'
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
Formal string representation vs. Informal string representation
repr: formal string representation of a Python object. The returned string representation can be used to rebuilt the original object, just like serialization/deserialization. It’s great for debugging programs.str: informal string representation of a Python object. It’s intended to be read by humans.strapplied to any built-in Python object always callrepr
>>> repr([1,2,3])
'[1, 2, 3]'
>>> str([1,2,3])
'[1, 2, 3]'
String interpolation is available since version 3.6. It’s called f-string.
>>> value = 523
>>> f"The value: {value}"
'The value: 523'
# function can be called
>>> lang = "go"
>>> f"The next one: {lang.upper()}"
'The next one: GO'
Bytes
String vs. Bytes
- A string object is an immutable sequence of Unicode characters.
- A bytes object is a sequence of integers with values from 0 to 256, mainly for dealing with binary data.
Two confusing items
- unicode: a set of characters
- utf-8: an encoding standard, which is used to present unicode. With different encodings, unicode will be presented with different values.
>>> c = "\u2713"
>>> c
'✓'
# try different encoding
>>> c.encode(encoding='utf-16')
b"\xff\xfe\x13'"
>>> c.encode(encoding='utf-8')
b'\xe2\x9c\x93'
# encoded value back to string, and by default utf-8 is the encoding/decoding standard.
>>> b'\xe2\x9c\x93'.decode()
'✓'
Control Flow
The ’ladder’ structure is like below
if condition1:
body1
elif condition2:
body2
elif condition3:
body3
...
elif condition(n-1):
body(n-1)
else:
body(n)
pass can be used if an empty body of if or else is neede.
if cond:
pass
else:
# do something else
A dictionary can be used to ease the ’ladder’ structure.
def take_action_a():
# do something
def take_action_b():
# do something else
def take_action_c():
# do another thing
func_dict = {'a': take_action_a,
'b': take_action_b,
'c': take_action_c}
# populate the desired function key, and here simple assign 'a' for demo purpose
desired_func_key = 'a'
func_dict[desired_func_key]()
for Loop
for loop is different from the one in ‘C family’ programming langauges. In Python, for iterates over the values returned by any iterable object, so it’s more like an iterator, instead of a loop structure.
>>> for elt in [1,2,3,4,5]:
... if elt % 2 == 0:
... print(elt)
...
2
4
Unpacking is supported by for.
>>> for idx, val in enumerate(["A", "B", "C"]):
... print(f"{idx}: {val}")
...
0: A
1: B
2: C
range, Generator and Memory Usage
When dealing with list holding large amount of elements, we may encounter the memory usage issue. Compare the memory consumption of a list and a range.
>>> import sys
>>> sys.getsizeof(list(range(1000000)))
8000056
>>> sys.getsizeof(range(1000000))
48
So using range or generator can reduce the strain on memory.
>>> x = list(range(1_000_000))
# using generator expression, we don't have to 'duplicate' the size of `x`
>>> g = (elt * elt for elt in x)
>>> import sys
>>> sys.getsizeof(g)
104
Boolean Values for Conditions
In Python
- 0 or empty values are
False. - Any other values are
True.
Some practical terms
- Values like
0.0and0+0jareFalse. - Empty String
""isFalse. - Empty list
[]isFalse. - Empty dictionary
{}isFalse. - The special value
Noneis alwaysFalse.
Some objects, such as file objects and code objects don’t have a sensible definition of 0 or empty element, so they should NOT be used in a Boolean context.
Some boolean related operators
inandnot into test the membershipisandis notto test the identityand,or, andnotto combine boolean values
Operators
==/!= vs. is/is not
Equality vs. Identity
==/!=: to test the equalityis/is not: to test the identity
>>> l1 = [1,2,3]
>>> l2 = [1,2,3]
>>> l1 == l2
True
>>> l1 is l2
False
and and or Used in Non-Boolean Context
and and or can be used in non-boolean context to ‘pick’ the object.
and: pick the first false object or the last objector: pick the first true object or the last object
>>> "a" and "" and "c"
''
>>> "a" and "b" and "c"
'c'
>>> "a" or "" or "c"
'a'
>>> "" or "" or "c"
'c'
Alternative to Ternary Operator ? :
Some programming languages provide the ternary opeator ? : such as below javascript code snippet
name = 1 ? "Yang" : "Yin"
console.log(name)
However, there is NO such ternary operator in Python. Python chooses a more readable style
>>> name = "Yang" if 1 else "Yin"
>>> print(name)
Yang
Functions
The basic function definition is like below
>>> def double(x):
... return x * 2
...
>>> double(5)
10
# function without paramters
>>> def subroutine():
... print("This is subroutine")
...
>>> subroutine()
This is subroutine
# function without explicit return
# in this case `None` is returned
>>> def no_explicit_return():
... print("No explicit_return")
...
>>> r = no_explicit_return()
No explicit_return
>>> r is None
True
Parameters
Three available options for function parameters
- Positional parameters
- Named parameters
- Variable numbers of parameters
Named parameters help remove the ambiguity in some cases
>>> def power(base, exponential):
... if exponential == 0:
... return 1
... else:
... return base * power(base, exponential-1)
...
# using named parameter we know it's cube of 2, not square of 3
>>> power(base = 2, exponential = 3)
8
In addition, Named parameters help in the default value case
>>> def greet(message="Hello", name="world"):
... print(f"{message}, {name}")
...
# use the default value of `message` parameter
>>> greet(name="NYC")
Hello, NYC
Variable numbers of parameters allow the function to handle arbitrary numbers of parameters. There is no method overloading in Python like the one in Java, and variable numbers of parameters can be used to mimic the feature. In addition, decorator pattern can be implemented with variable numbers of parameters.
def decrate(fn):
def decorated_fn(*parameters, **key_val_pairs):
print("Doing decoration tasks...")
fn(*parameters, **key_val_pairs)
print("End\n")
return decorated_fn
def greet(message="Hello", name="world"):
print(f"{message}, {name}")
decorated_greet = decrate(greet)
decorated_greet("Hi")
decorated_greet(name="NYC")
Functions as the First-class Citizens
Functions can be assigned to variables, just as other values in Python.
>>> def foo():
... print("foo funciton!")
...
>>> fn = foo
>>> fn()
foo funciton!
Anonymous functions are implemented as lambda expressions.
>>> fn = lambda: print("bar function!")
>>> fn()
bar function!
>>> fn
<function <lambda> at 0x000002E3BDD3EB90>
High order functions are supported ’natively’, since functions are first-class citizens.
# a function can accept functions and return function
def combine(outer_fn, inner_fn):
def combined_fn(*parameters, **key_values):
return outer_fn(inner_fn(*parameters, **key_values))
return combined_fn
def square(x):
return x * x
dbl = lambda x: x * 2
double_of_squred = combine(dbl, square)
r = double_of_squred(5)
print(r)
Scope: global and nonlocal
Local variables vs. global variables vs. nonlocal variables
- local variables: variables defined in the function
globalvariables: variabels defined outside the functionnonlocalvariables: variables defined in the ’enclosing’ scope
Compare local variables and global variables.
a = 10
def foo():
# global a
a = 20
print(f"a in foo: {a}")
foo()
print(f"global a: {a}")
# we get below result
# a in foo: 20
# global a: 10
# when `global a` is uncommented, we get below result
# a in foo: 20
# global a: 20
nonlocal refers to the one defined in the enclsoing function.
a = 10
def foo():
a = 20
def bar():
nonlocal a
a = 30
print(f"a in bar: {a}")
bar()
print(f"a in foo: {a}")
foo()
print(f"global a: {a}")
# we get below result
# a in bar: 30
# a in foo: 30
# global a: 10
Generator Functions
Besides generator expressions, there are generator functions to help on better memory usage.
# generator function
def gen_1_m():
i = 1
while i < 1_000_000:
yield i
i = i + 1
s = 0
for elt in gen_1_m():
s = s + elt
print(s)
yield from can be used to delegate the generator to another generator.
g1 = range(1,500_000)
g2 = range(500_000,1_000_000)
def gen_1_m():
yield from g1
yield from g2
s = 0
for elt in gen_1_m():
s = s + elt
print(s)
Modules and Scoping Rules
What is a module?
- a file containing Python code, which defines Pythong functions or objects
- name of the file defines the name of the module
Why use modules?
- for better organizing source code
- modules help avert name-clash issue. Suppose two people both define
greetfunction.module_a.greetmodule_b.greet
To use a module, import it first.
# import the built-in `math` module
>>> import math
# check the members of the module
>>> dir(math)
['__doc__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'comb', 'copysign', 'cos', 'cosh', 'degrees', 'dist', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'isqrt', 'lcm', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'nextafter', 'perm', 'pi', 'pow', 'prod', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc', 'ulp']
# reference to `pi`defined in `math`
>>> math.pi
3.141592653589793
Another import form is from <module> import <member/*>
>>> from math import pi
>>> pi
3.141592653589793
# we can even import all members using `*`
>>> from math import *
>>> gcd
<built-in function gcd>
The Module Search Path
To make module files available to Python to import, put it under any path entries defined in sys.path
>>> import sys
>>> sys.path
['', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\python310.zip', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\DLLs', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\lib', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310', 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages']
Note
- The first module file found in the entries is used.
- If no desired module can be found, an
ImportErrorexception is raised.
How to define the path entries in sys.path list?
sys.pathlist is initialized based onPYTHONPATHenvironment variable if it exists.- Define
.pthfile to indicate the path entries, and put the.pthfile under the directory defined bysys.prefix
Scoping Rules and Namespaces
A namespace maintains the mapping from identifiers to objects. A statement like x = 1 adds x to a namespace and associates x with the value 1.
In Python there are three namespaces
- local: holding local functions and variables
- global: holding module functions and module variables
- built-in: holding built-in functions
When Python needs to ’locate’ the identifier, it follows below sequence
- Check local namespace.
- If the identifier doesn’t exist in local namespace, check global namespace.
- If the identifier doesn’t exist in global namespace, check built-in namespace.
- If the identifier doesn’t exist in any of above,
NameErroroccurs.
When a function call is made, a local namespace is created.
def foo():
x = 1
print(f"In foo locals: {locals()}")
print(f"In foo globals: {globals()}")
y = 2
foo()
# on global level, locals() is equivalent to globals()
print(locals() == globals())
print(dir(__builtins__))
# executing above code snippet, we get
In foo locals: {'x': 1}
In foo globals: {'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x0000013F1797C700>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__file__': 'C:\\sandbox\\PythonLab\\Scripts\\lab.py', '__cached__': None, 'foo': <function foo at 0x0000013F178B3E20>, 'y': 2}
True
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning', 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError', 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EncodingWarning', 'EnvironmentError', 'Exception', 'False', 'FileExistsError', 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'ModuleNotFoundError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError', 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'WindowsError', 'ZeroDivisionError', '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__', '__package__', '__spec__', 'abs', 'aiter', 'all', 'anext', 'any', 'ascii', 'bin', 'bool', 'breakpoint', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'vars', 'zip']
Interaction between Python Program and System
Combine Script and Module
A Python program can be treated as a script or a module depending on the execution context. The structure below does the trick.
if __name__ == '__main__':
main()
else:
# module-specific initialization code if needed
When the Python file is executed as Python script, its __name__ is set to __main__.
Commandline Arguments
The arguments passed from commandline can be retrieved via sys.argv.
import sys
def main():
print(sys.argv)
main()
sys.argv is a list
- The first element is the name of the script file.
- The following elements are the arguments passed from commandline.
PS C:\sandbox\PythonLab\TempLab> python .\my_sciprt.py Hello World "Test Script"
['.\\my_sciprt.py', 'Hello', 'World', 'Test Script']
# omit `.\` to invoke the script file
PS C:\sandbox\PythonLab\TempLab> python my_sciprt.py Hello World "Test Script"
['my_sciprt.py', 'Hello', 'World', 'Test Script']
Use argparse module if more advanced features are needed to handle commandline arguments.
Filesystem Interaction
File Paths
Path related modules
os.path: before Python 3.5, and it’s imperative style.pathlib: since Python 3.5, and it’s OO style.
os.path provides a useful abstraction layer to ease operations on filesystems. For example, file path separator may be differnt from OS to OS.
\in Windows OS/in *nix OS
Using os.path.sep, we don’t have to worry about the difference. As a result, program with that abstraction layer
- use
os.path.curdir - NOT use
.
Unfortunately, there is no unified concept of root path. Think about the path types we have in Windows OS
C:\means the C drive\\myftp\share\means a UNC root path As a result, we do NOT have something likeos.path.rootin Python.
To form a path
# form a path with os.path
os.path.join("c:/", "Sandbox", "Temp")
# form a path with pathlib
# note `joinpath` of Path object is an instance method
pathlib.Path().joinpath("c:/", "Sandbox", "Temp")
pathlib.Path() / "c:/"/ "Sandbox"/ "Temp"
pathlib.Path("c:/") / "Sandbox"/ "Temp"
Filesystem Operations
Filesystem operations are performed via os module. Don’t get confused with sys module. Think ‘sys’ as ‘Python System’.
# change directory
os.chdir("My Target Dir")
# print current working directory
os.getcwd()
# list items in the directory
# Note in windows, we may encunter PermissionError if the dir is read-only
os.listdir(os.path.curdir)
# get file/dir info
os.path.exist(path_as_arg))
os.path.isfile(path_as_arg))
os.path.isdir(path_as_arg))
os.path.getsize(path_as_arg))
os.path.getatime(path_as_arg))
# renme file/dir
os.rename("original", "target")
# remove a file
# `remove` function cannot remove a directory
os.remove("file_to_be_removed")
# `rmdir` can remove an empty directory
os.rmdir("empty_dir_to_be_removed")
# create a directory
os.mkdir("dir_name")
os.makedirs("aut_create_intermediate_dirs")
If OO style is preferred, use pathlib module. With pathlib we create different objects to represent different paths, so we don’t do operation like pathlib_obj.chdir("Target Dir")
# create the obj representing the current dir
curr_dir = pathlib.Path()
# create the obj representing the specified path
root_dir = pathlib.Path("/")
# list the items in the directory
for fs_item in curr_dir.iterdir():
print(fs_item) # fs_item is Path object as well
# print current working directory
# below two expressions return the same value
# note current working directory is determined by where we started Python program and if we switched to another dir later
curr_dir.cwd()
root_dir.cwd()
# get file/dir info
path_obj.exists()
path_obj.is_file()
path_obj.is_dir()
path_obj.stat()
# rename a file or directory
path_obj.rename("new_name")
# remove a file
path_obj.unlink()
# remove an empty directory
path_obj.rmdir()
# create a directory
path_obj.mkdir() # requires intermediate directories exist
path_obj.mkdir(parents=True) # intermediate directories will be created automatically
Utilities for Filesystem Operation
os.scandir provides an easy approach to get metadata of filesystem entries under a directory.
# use a context manager to ensure the file descriptor is released
# regardless of whether the iterator is full iterated
with os.scandir(os.curdir) as my_dir:
for fs_entry in my_dir:
print(f"{fs_entry.name}: {fs_entry.stat()}")
glob.glob provides the globbing functionality.
import glob
os.chdir("c:/sandbox/pythonlab/scripts")
py_files = glob.glob("*.py")
for py_file in py_files:
print(f"Python File: {py_file}")
shutil.rmtree can remove a non-empty directory, and shutil.copyree can recursively make copies of all the files and subdirectories in a given directory.
import shutil
shutil.rmtree(nonempty_dir_to_be_removed)
shutil.copytree(src, dst)
os.walk(directory, topdown=True, onerror=None, followlinks=False) traverses directory structure recursively. The function returns three things
- root or path of the directory
- a list of its subdirectories (
os.walkwill be called on each subdir respectively) - a list of its files
for root, subdirs, files in os.walk("Test"):
for file in files:
print(f"file name: {file}")
# remove backup directory from the recursion
subdirs[:] = [e for e in subdirs if e != "backup"]
print(f"Subdir list now is {subdirs}")
Note
- If
topdownis True or not present, the files in each directory are processed before moving to subdirectories. That means we have a chance to remove some subdirectories, such as.git/,.config/from the recursion.
File I/O
Open and Close Files
The classic open-process-close file operation is like below
file_obj = open("c:/temp/hello.txt")
print(file_obj.readline())
file_obj.close()
print(f"File closed? {file_obj.closed}")
Using context managers, we don’t need to explictly close the file.
with open("c:/temp/hello.txt") as file_obj:
print(file_obj.readline())
Specify the mode to open file with
r: read mode, the default modew: write mode, data in file will be truncated before writing operationa: append mode, new data will be appended to the end of the filex: new file mode, it throwsFileExistsErrorif the file exists already+: read and write modet: text mode, the default modeb: binary mode, it supports random access
With above modes, we have
rt: read as textw+b: random accessing the file in binary mode with truncating the file firstr+b: random accessing the file in binary mode without truncating the file first
In addition, pay attention to below options when open the file
encoding: sepcify the encoding to open the file withnewline: different operating systems may use different characters as the new line character
Suppose we have a txt file containing below unicode chars with utf-8 encoding
✓💓🍁
We can specify the encoding as utf-8 when open the file
with open("c:/temp/unicode.txt", encoding="utf-8") as file_obj:
print(file_obj.read(1))
print(file_obj.read(1))
print(file_obj.read(1))
Read and Write with TextIOWrapper
In most cases, read, readline and readlines without argument are good enough to handle file reading. However, there will be some exceptional cases like
- the file is too large
- the line contains too many contents
- there are too many lines
Two approaches to tackle the issue
- provide additional arguments to affect the amount of data being read every time
- use iterator to lazily load file contents
# argument to affect the amount of data being read every time
size_to_read = 50
with open("c:/temp/the_zen_of_python.txt", mode="rt") as file_obj:
while sized_content := file_obj.read(size_to_read):
print(sized_content, end='')
# treat file object as generator
# `open` returns a file object which is an iterator
# `isinstance(fo, collections.abc.Iterator)` returns True
with open("c:/temp/the_zen_of_python.txt", mode="rt") as file_obj:
for line in file_obj:
print(line, end="")
Note
sizeparameter ofreadlineindicates the max size of chars to read before encoutering the newline character, so we may read less than the size on some lines.hintparameter ofreadlinesindicates the size of chars to be exceeded by reading lines, so we may read an ’extra’ line, just for exceeding thehintsize.
We perform ‘write’ operation mainly with functions
writewritelines
Below code snippet implements a dummy version of ‘copy’
# dummy copy
import os
size_of_chunk = 128
source_file = os.path.join("C:/", "temp", "the_zen_of_python.txt")
target_file = os.path.join("C:/", "temp", "zen.txt")
# binary mode so both binary files and text files can be handled
with open(source_file, "rb") as sf_obj:
with open(target_file, "wb") as tf_obj:
while content_chunk := sf_obj.read(size_of_chunk):
print(">", end="")
tf_obj.write(content_chunk)
print("Done")
Read and Write with pathlib
pathlib provides OO style read/write operations. It encapsulates actions like ‘open’ and ‘close’, so we don’t need to do them by ourselves. Below are the related functions
pathlib.Path.write_bytespathlib.Path.write_textpathlib.Path.read_bytespathlib.Path.read_text
# dummy copy via pathlib's OO style
import pathlib
source_file = pathlib.Path() / "C:/" / "temp" / "the_zen_of_python.txt"
target_file = pathlib.Path() / "C:/" / "temp" / "zen.txt"
target_file.write_bytes(source_file.read_bytes())
print("Done using pathlib")
read_bytes and read_text don’t provide a paramter to specify the chunk size to read each time, and those functions read the entire file into memory. If memory-efficient is important, use the open function of the Path object to get the ‘file object’ and then work as the classic open style
import pathlib
chunk_size = 128
source_file = pathlib.Path() / "C:/" / "temp" / "the_zen_of_python.txt"
target_file = pathlib.Path() / "C:/" / "temp" / "zen.txt"
with source_file.open(mode="rb") as sf_obj:
with target_file.open(mode="wb") as tf_obj:
while chunk := sf_obj.read(chunk_size):
tf_obj.write(chunk)
print("Done!")
File as Standard Out
A file can be set as stdout, so that print function will write the content to the file instead of to the terminal.
import sys
with open("c:/temp/output.txt", mode="wt") as of_obj:
sys.stdout = of_obj
print("Hello")
print("World")
# reset stdout back
sys.stdout = sys.__stdout__
print("Hi")
Alternative to setting sys.stdout to a file, in each print we can set the file parameter to the specified file.