-
Notifications
You must be signed in to change notification settings - Fork 1
Full script for docx extraction #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| syms = ['M', 'CM', 'D', 'CD', 'C', 'XC', 'L', 'XL', 'X', 'IX', 'V', 'IV', 'I'] | ||
| roman = '' | ||
| i = 0 | ||
| while n > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get stuck in an infinite loop if we pass 0, which we can't exclude from happening.
Should rewrite without while-loop, something like this (not tested):
for v, s in zip(val, syms):
count = n // v
roman += s * count
n -= v * count
|
|
||
| def get_style_num_map(styles_root, numId_to_abstract, abstract_to_levels, num_root): | ||
| style_num_map = {} | ||
| for style in styles_root.findall('w:style', NS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are two sources for the style_num_map, the non-abstract and the abstract ones, each being handled by a different for-loop.
Are they independent from one another? Does it makes sense to put them in the same map?
Also, intuitively, I would create two separate functions (or sub-functions, depending on whether we want them in the same map or not)
| return style_num_map | ||
|
|
||
| def extract_paragraph_text(para): | ||
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not work for paragraphs that contain track changes (w:ins and w:del), to check.
In that case, we probably want to pass a parameter to decide whether we want the version pre or post changes, with post changes by default I guess.
| continue | ||
| style_id = style.attrib.get(f'{{{NS["w"]}}}styleId') | ||
| pPr = style.find('w:pPr', NS) | ||
| if pPr is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be if pPr: for a more Pythonic way to write it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xml.etree.ElementTree: Testing the truth value of an Element is deprecated. In a future release it will always return True. Prefer explicit len(elem) or elem is not None tests instead. zipimport.zipimporter.load_module() is deprecated: use exec_module() instead.
| return style_num_map | ||
|
|
||
| def extract_paragraph_text(para): | ||
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can remove the [ ... ] to make it a generator comprehension and avoid creating a list in memory.
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() | |
| return ''.join(node.text or '' for node in para.findall('.//w:t', NS)).strip() |
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() | ||
|
|
||
| def get_list_number(para, style_num_map, counters, numId_to_abstract, abstract_to_levels): | ||
| pPr = para.find('w:pPr', NS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good argument to integrate that code into the parsing done by python-docx would be that it will parse such info only once, such that we can simply do para.pPr afterwards whenever we need it, and avoid losing performance running .find(...) each time.
| if counters[numId][ilvl] == 0: | ||
| counters[numId][ilvl] = level_def.get('start', 1) | ||
| else: | ||
| counters[numId][ilvl] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the function is called get_list_number, I would expect it to return something but not modify its parameters, like counters here.
I would expect the counters mapping to be already constructed already. Here this looks like it create inconsistent numbering if we call get_list_number twice for the same paragraph.
|
|
||
| lvlText = level_def.get('lvlText', '') | ||
| fmt_values = {} | ||
| for level in range(9): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid any confusion regarding to fact that "level" starts at 1 and not 0, I would initialize counters as defaultdict(lambda: defaultdict(lambda: 1)) and initialize level here at 1:
for level in range(1, 10):
and remove the + 1s afterwards.
Maybe applies to ilvl as well?
And by the way, if 10 the maximum level that can possibly be defined, per the specs?
| for level in range(9): | ||
| if f'%{level + 1}' in lvlText: | ||
| if counters[numId][level] == 0: | ||
| level_start = abstract_to_levels[abstractId].get(level, {}).get('start', 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is a bit overkill, but we could initialize abstract_to_levels sub-dict with {'start': 1} by default, such that we don't need the double .get
| numbering_str = get_list_number(para, style_num_map, counters, numId_to_abstract, abstract_to_levels) | ||
| text = extract_paragraph_text(para) | ||
| if text: | ||
| lines.append(f'{numbering_str}{text}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space in between? Or maybe f'{numbering_str} {text.lstrip()}' to have one space at most.
No description provided.