Skip to main content

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Visit Stack Exchange
Asked
Modified 13 days ago
Viewed 1k times
5
\$\begingroup\$

I want to know if the following parser covers every possible cases, and if the code can be cleaner/better organized.

Python 3 code:

def add_element(cache, element):
    if element != '':
        cache[-1].append(element)
    return ''


def to_py(raw_tcl_list):
    out = []
    cache = [out]
    element = ''
    escape = False
    for char in raw_tcl_list:
        if escape:
            if char not in ["\\", "{", "}", "[", "]", "$"]:
                raise ValueError("Incorrect escape character %s" % char)
            element += char
            escape = False
        elif char == "\\":
            escape = True
        elif char in [" ", "\t", "\r", "\n"]:
            element = add_element(cache, element)

        elif char == "{":
            a = []
            cache[-1].append(a)
            cache.append(a)

        elif char == "}":
            element = add_element(cache, element)
            if len(cache) < 2:
                raise ValueError("Close bracket without opening bracket.")

            cache.pop()
        else:
            element += char
    if len(cache) != 1:
        raise ValueError("Mismatched brackets.")
    return out[0]


def to_tcl(py_list):
    if type(py_list) == list:
        out_str = "{ "
        for item in py_list:
            out_str += to_tcl(item)
        out_str += "} "
        return out_str
    else:
        out_str = str(py_list) + " "
        for c in ["\\", "{", "}", "[", "]", "$"]:
            out_str = out_str.replace(c, "\\" + c)
        return out_str


import pprint

x = "{ 12 apple {100} {} {{12 34}} \n {56\n { \\{78 {11 12 11} 10}}}"
y = to_py(x)
z = to_tcl(y)
print("# Original #")
pprint.pprint(x)
print("\n# Translated to Python List #")
pprint.pprint(y)
print("\n# Translat back to Tcl List #")
print(z)

Output:

# Original #
'{ 12 apple {100} {} {{12 34}} \n {56\n { \\{78 {11 12 11} 10}}}'

# Translated to Python List #
['12',
 'apple',
 ['100'],
 [],
 [['12', '34']],
 ['56', ['{78', ['11', '12', '11'], '10']]]

# Translat back to Tcl List #
{ 12 apple { 100 } { } { { 12 34 } } { 56 { \{78 { 11 12 11 } 10 } } } 

List of Tcl special characters: http://wiki.tcl.tk/989#pagetoceaa9fda9

\$\endgroup\$

3 Answers 3

3
\$\begingroup\$

speed

We're using interpreted Python, so it's not like this is going to be a speed demon. If we really cared about speed we would opt for Rust.

That said, if we can process chunks of characters at a time, then we should prefer that. In many input datasets, most characters will be "boring".

import re
from string import ascii_letters, digits

boring = re.compile(rf"[ ,.{ascii_letters}{digits}-]+")

Now we can rely on fast matching within the regex engine on large blocks of boring characters, and operate on much more than a single character at a time. The goal is to execute fewer python interpreted opcodes. We want to make fewer calls to C routines, and pass in larger arguments.

dispatch

Within the to_py() loop we need to test the escape bool. But once that's done we've got four matches on char, four blocks of code, plus a fifth "default" block of code. Consider defining at least four tiny helper functions, which are dict values:

dispatch = {
    "\\": set_escape_flag,
    " ":  assign_element,
    "\t": assign_element,
    ...
    "{":  do_append,
    "}":  do_pop,
}

For most input datasets, the OP code will see a few False booleans, and then default to doing an append:

        else:
            element += char

I am suggesting this alternative:

        fn = dispatch.get(char, default_append_char)
        fn()  # side effects some state variables

or more simply

        dispatch.get(char, default_append_char)()

side effect & return value

Computing a constant value of the empty string is a little odd:

def add_element(cache, element):
    if element != '':
        cache[-1].append(element)
    return ''

Both call sites perform:

            element = add_element(cache, element)

Consider returning the default of None when the function falls off the end, and adding a line to each call site. (Or maybe the helper function could assign to element?)

            add_element(cache, element)
            element = ""

six versus a single operation

Imagine that out_str is "big", more than a kilobyte. Then doing six operations on it is not great.

        for c in ["\\", "{", "}", "[", "]", "$"]:
            out_str = out_str.replace(c, "\\" + c)

Better to ask a C routine to do it all in a single pass.

        xlate = str.maketrans({
            "\\": "\\\\",
            "{": r"\{",
            "}": r"\}",
            "[": r"\[",
            "]": r"\]",
            "$": r"\$",
        })
        return out_str.translate(xlate)

nit: Instead of str(py_list) + " ", prefer a formatted string: f"{py_list} "

nit, typo: "Translat"

instanceof()

    if type(py_list) == list:

In an OOP setting, respect inheritance and prefer the idiomatic

    if isinstance(py_list, list):

Also, it feels weird that def to_tcl(py_list): clearly asked the caller to pass in a list, yet the Public API we're defining also accommodates non-lists. The signature suggests that passing in a non-list should raise a fatal ValueError. A formal parameter that is instead named py_expression would clarify matters.

naming

The documentation burden is low for these variables, since they're not intended to be exported as part of a Public API.

y = to_py(x)
z = to_tcl(y)

Still, it would be more helpful to name them p and t, for python and tcl.

\$\endgroup\$
3
\$\begingroup\$

There are a few ways the to_tcl function can be streamlined.

def to_tcl(py_list):
    if type(py_list) == list:
        out_str = "{ "
        for item in py_list:
            out_str += to_tcl(item)
        out_str += "} "
        return out_str
    else:
        out_str = str(py_list) + " "
        for c in ["\\", "{", "}", "[", "]", "$"]:
            out_str = out_str.replace(c, "\\" + c)
        return out_str

Let's take that list in the first branch, adding to out_str and map to_tcl to each element of py_list, then pass that to str.join.

But now we might as well just use an f-string.

def to_tcl(py_list):
    if isinstance(py_list, list):
        return '{ ' + ''.join(map(to_tcl, py_list)) + '} '
    else:    
        out_str = str(py_list) + " "
        for c in ["\\", "{", "}", "[", "]", "$"]:
            out_str = out_str.replace(c, "\\" + c)
        return out_str

In the else branch, we can use the form of re.sub that accepts a lambda function to avoid the loop by compiling a regular expression that matches any of these tokens.

def to_tcl(py_list):
    if isinstance(py_list, list):
        return '{ ' + ''.join(map(to_tcl, py_list)) + '} '
    else:
        tokens = ["\\", "{", "}", "[", "]", "$"]
        r = re.compile('|'.join(map(re.escape, tokens)))
        return re.sub(r, lambda m: '\\' + m.group()[0], f"{py_list} ")

This has the added benefit that "\\" not being the first element in the list won't affect the results. When we loop over a string and perform repeated replacements as you have done, we always need to be careful that a previous replacement doesn't affect a later one.

\$\endgroup\$
2
\$\begingroup\$

Tools

You could run code development tools to automatically find some style issues with your code.

ruff and pylint recommend changing:

if type(py_list) == list:

to:

if isinstance(py_list, list):

DRY

In the to_tcl function, this line is in both branches of the if/else:

return out_str

It can be factored out to eliminate the repetition:

if type(py_list) == list:
    out_str = "{ "
    for item in py_list:
        out_str += to_tcl(item)
    out_str += "} "
else:
    out_str = str(py_list) + " "
    for c in ["\\", "{", "}", "[", "]", "$"]:
        out_str = out_str.replace(c, "\\" + c)
return out_str

This list is repeated twice:

["\\", "{", "}", "[", "]", "$"]

It can be assigned to a constant.

Simpler

This line:

raise ValueError("Incorrect escape character %s" % char)

can be simplified with a f-string:

raise ValueError(f"Incorrect escape character {char}")

Documentation

The PEP 8 style guide recommends adding docstrings for functions.

def add_element(cache, element):
    """
    Describe what this function does
    Describe what the input data types are
    Describe what the return data type is
    """

For the other functions, also describe how they do what they do (using recursion, etc.).

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.