#Stripping metadata, headings and categories out of dump parsed to .txt

105 messages · Page 1 of 1 (latest)

smoky tinsel
#

Stripping metadata, headings and categories out of dump parsed to .txt

quaint hornet
#

@smoky tinsel Could just be a problem with the regex. Maybe try pasting the xml and the regex into a tester like this and tweak it until it does what you want: https://regex101.com/

regex101

Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python, GO, JavaScript, Java, C#/.NET, Rust.

smoky tinsel
quaint hornet
smoky tinsel
quaint hornet
smoky tinsel
#

It is a .txt file

#

Well, the resault is

quaint hornet
smoky tinsel
smoky tinsel
quaint hornet
#

@smoky tinsel ok well first of all, you need to escape characters. put a backslash before all the forward slashes

#

having such a long regex line is confusing. maybe make an array of regex objects which each match to one thing, and loop through the array to see if any match

smoky tinsel
quaint hornet
#
regexList = []

regexList.append(re.compile(r"Genetic Age: 3E 2950-3E 3000"))
regexList.append(re.compile(r"Awakening Age: 3E 3000-3E 3415"))
# ... add as many expressions as needed

for line in input_file:
  try:
    for regex in regexList:
      if regex.match(line):
        raise Exception()
    outputfile.write(line)
  except Exception:
    continue
#

i havent actually tried this code, but thats what im thinking

#

to make it easier to understand what the regex is doing

smoky tinsel
#

To strip out the meta data with mwpaserfromhell @quaint hornet

quaint hornet
#

i think there are better ways to do this. probably an xml parser for python

quaint hornet
# smoky tinsel Why not mwfromhell?

as far as i know, thats for taking wikitext and turning it into plain text. it doesnt have functions for removing metadata from an xml file

#

if you need help with doing all that, ill have to come back to it at a later time

smoky tinsel
#

This code works, but only strips one page @quaint hornet

smoky tinsel
#

got a working code, thank you for your help

quaint hornet
smoky tinsel
fleet sphinxBOT
quaint hornet
#

Need to escape the brackets: \{\{.*?\}\}

smoky tinsel
quaint hornet
#

What exactly are you trying to do?

smoky tinsel
quaint hornet
#

I dont know why that wouldnt work. Whats an example of text that has a template you want to remove?

smoky tinsel
#

<strong>MediaWiki has been installed.</strong>

Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.

== Getting started ==

==History==

The A'voreld clan is a clan in the kin'toni'ati race. Created by kin'toni that ran from the Shattering that happened from 4E 100 to 4E 150, for they did not want to get involved in the wars that were spanning the world of Terael at the time. One reason for the Shattering was that a gigantic volcanic eruption in 4E 165 had resulted in the coldest winters in recorded history. Many of the planet's volcanoes began to erupt during that cold winter and did for over 10 years, blocking out the sun and killing a lot of plant life. This meant that other animals did not get food and could not live.

Ask a question about the MediaWiki software.

This is an index of all supported configuration settings based on the DefaultSettings.php file.
Never edit DefaultSettings.php; copy appropriate lines to LocalSettings.php instead and amend them as appropriate.
If you can't find a configuration setting here, see if it is defined in DefaultSettings.php.
The variable should have some documentatio...

This landing page links to core technical documentation about MediaWiki internationalisation and localisation (i18n and L10n).
A core principle of MediaWiki is that i18n must not be an afterthought: i18n and l10n are an essential component even in the earliest phases of software development.

quaint hornet
#

Isnt your code reading it line by line?

#

It wont match a template that spans multiple lines

smoky tinsel
#

How can I make it match the multiple lines?

quaint hornet
#

I think instead of going through each line and outputting what doesnt match the regex, you should just grab the entire text into a variable, then use regex to replace those matches with a blank string

#

And do that for each thing you want to remove

smoky tinsel
#

Code

#

@quaint hornet

quaint hornet
#

Oh i think .* doesnt match newlines. Also you still havent escaped the braces. Try that first

smoky tinsel
#

What could added to match newlines

quaint hornet
#

What did you actually change? It looks the same

smoky tinsel
#

Escaped the braces does not work

quaint hornet
#

[\s\S]*?

#

Youre still doing it line by line

smoky tinsel
#

tried [\s\S]*? no change

quaint hornet
#

Just do output = template_regex.sub("", input_file)

#

To use the whole file

#

Oh

#

Well i can help fix it later today

smoky tinsel
quaint hornet
#

@smoky tinsel can you put the code you're using on pastebin? discord removes backslashes

smoky tinsel
#

@quaint hornet

#

THis is to strip out page headers, categories templates and son so

quaint hornet
#

@smoky tinsel

smoky tinsel
#

Ok

#

Traceback (most recent call last):
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 11, in <module>
input_file = re.sub(pattern, "", input_file)
File "/usr/lib/python3.10/re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

@quaint hornet

quaint hornet
#

um ok i edited it, try the new version

#

@smoky tinsel

smoky tinsel
# quaint hornet <@456226577798135808>

Traceback (most recent call last):
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 14, in <module>
input_file = pattern_regex.sub("", input_file)
TypeError: expected string or bytes-like object

quaint hornet
#

hmm

#

@smoky tinsel try now

smoky tinsel
quaint hornet
#

what happens if you just run this:

with open("taereltxt.txt", "r") as input_file, open("output.txt", "w") as output_file:
    print(str(input_file))
smoky tinsel
#

<_io.TextIOWrapper name='taereltxt.txt' mode='r' encoding='UTF-8'>

quaint hornet
#

oh

smoky tinsel
smoky tinsel
quaint hornet
#

yeah

smoky tinsel
#

@quaint hornet This regrex needs to work as well

pattern = r"^(==History==|==Stone Age: Before 1E 0==|==Copper Age: 1E 1-1E 2200==|==Bronze Age: 1E 2200-1E 4400==|==Iron Age: 2E 0-2E 700==|==Ancient Age: 2E 700-2200==|==Middle Ages: 3E 0-2050==|==Early Modern Age: 2050-3E 2600==|==Industrial Age: 3E 2600-3E 2700==|==Machine Age: 3E 2700-3E 2800==|==Atomic Age: 3E 2800-3E 2850==|==Space Age: 2E 2850-2E 2900==|==Information Age: 3E 2850-3E 2900==|==Genetic Age: 3E 2950-3E 3000==|Awakening Age: 3E 3000-3E 3415==|==Shattering Age: 4E 0 - 4E 250==|==History|Geography|Plants|Animals|Biology==|==Psychology==|==Culture==|==Government==|==Military==|==Religion==|==Miscellany==|==History (A'voreld kin'toni Clan)==|==Biology (A'voreld kin'toni Clan)==|==Psychology (A'voreld kin'toni Clan)==|==Culture (A'voreld kin'toni Clan)==|==Government (A'voreld kin'toni Clan)==|==Military (A'voreld kin'toni Clan)==|==Religion (A'voreld kin'toni Clan)==|==Miscellany (A'voreld kin'toni Clan)==|\sTaerel:.?|\sCategory:.?)\n"

quaint hornet
#

@smoky tinsel ok try again

smoky tinsel
fleet sphinxBOT
quaint hornet
#

i can make something thats easier for you to add new things that need to be removed. but it'll have to be tomorrow probably, i have to go to work in like ten minutes

smoky tinsel
quaint hornet
#

remove the \n, thats not needed

#

and the parentheses around it all

smoky tinsel
#

I just want to fix the category removal stuff

#

The rest works

quaint hornet
#

then just add new things by adding for example |\[\[Category\]\] to the end of it

#

pattern = r"==History==|==Stone Age: Before 1E 0==|==Copper Age: 1E 1-1E 2200==|==Bronze Age: 1E 2200-1E 4400==|==Iron Age: 2E 0-2E 700==|==Ancient Age: 2E 700-2200==|==Middle Ages: 3E 0-2050==|==Early Modern Age: 2050-3E 2600==|==Industrial Age: 3E 2600-3E 2700==|==Machine Age: 3E 2700-3E 2800==|==Atomic Age: 3E 2800-3E 2850==|==Space Age: 2E 2850-2E 2900==|==Information Age: 3E 2850-3E 2900==|==Genetic Age: 3E 2950-3E 3000==|Awakening Age: 3E 3000-3E 3415==|==Shattering Age: 4E 0 - 4E 250==|==History|Geography|Plants|Animals|Biology==|==Psychology==|==Culture==|==Government==|==Military==|==Religion==|==Miscellany==|==History (A'voreld kin'toni Clan)==|==Biology (A'voreld kin'toni Clan)==|==Psychology (A'voreld kin'toni Clan)==|==Culture (A'voreld kin'toni Clan)==|==Government (A'voreld kin'toni Clan)==|==Military (A'voreld kin'toni Clan)==|==Religion (A'voreld kin'toni Clan)==|==Miscellany (A'voreld kin'toni Clan)==|\sTaerel:.?|\sCategory:.?|\[\[Category\]\]"

#

like that

smoky tinsel
#

Category names and content in [[Taerel: ]] vary the whole [[Category:X]] line and all [[Taerel:X]] needs removal

quaint hornet
#

or |\[\[.*?\]\]

#

that will remove everything between [[ ]]

smoky tinsel
# quaint hornet that will remove everything between [[ ]]

File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 12, in <module>
pattern_regex = re.compile(pattern)
File "/usr/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.10/sre_parse.py", line 843, in _parse
raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 0

quaint hornet
#

did you remove the first parenthsis

smoky tinsel
quaint hornet
#

paste the regex

smoky tinsel
quaint hornet
#

@smoky tinsel ok i gotta go, i can help out tomorrow if you need

quaint hornet
#

@verbal ginkgo hey sorry i probably cant look at it until tomorrow, im busier than i thought id be

quaint hornet