#Stripping metadata, headings and categories out of dump parsed to .txt
105 messages · Page 1 of 1 (latest)
@smoky tinsel Could just be a problem with the regex. Maybe try pasting the xml and the regex into a tester like this and tweak it until it does what you want: https://regex101.com/
The text I am strilling from is plain text and very big
What about just copying the part of the text that should be removed but isnt, into the tester, and see if you can figure it out that way?
Iwould prefer having a fixed verson given
Could you paste a bit of the xml into pastebin which has a part that should be removed but isnt?
It is a .txt file
Well, the resault is
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
And all of that text should have been removed? Or just some of it?
Stripper code
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
And all, for each page within the XML
@smoky tinsel ok well first of all, you need to escape characters. put a backslash before all the forward slashes
having such a long regex line is confusing. maybe make an array of regex objects which each match to one thing, and loop through the array to see if any match
https://pastebin.com/yDUkJbum - with the backslashes
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
regexList = []
regexList.append(re.compile(r"Genetic Age: 3E 2950-3E 3000"))
regexList.append(re.compile(r"Awakening Age: 3E 3000-3E 3415"))
# ... add as many expressions as needed
for line in input_file:
try:
for regex in regexList:
if regex.match(line):
raise Exception()
outputfile.write(line)
except Exception:
continue
i havent actually tried this code, but thats what im thinking
to make it easier to understand what the regex is doing
To strip out the meta data with mwpaserfromhell @quaint hornet
The currant resalt of that is this: https://pastebin.com/V0qfdNsF
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
i think there are better ways to do this. probably an xml parser for python
Why not mwfromhell?
as far as i know, thats for taking wikitext and turning it into plain text. it doesnt have functions for removing metadata from an xml file
@smoky tinsel using a module like this: https://www.askpython.com/python/examples/python-xml-parser
you would grab only the sections of xml you want. then youd run those sections through the mwparser to get plain text
if you need help with doing all that, ill have to come back to it at a later time
This code works, but only strips one page @quaint hornet
got a working code, thank you for your help
Oh great to hear!
Regrex to match all content inside a template?
{{.*?}}) did not work
Need to escape the brackets: \{\{.*?\}\}
Still has the template stuff
What exactly are you trying to do?
Remove the templates and content within the templates
I dont know why that wouldnt work. Whats an example of text that has a template you want to remove?
<strong>MediaWiki has been installed.</strong>
Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.
== Getting started ==
- [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
- [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
- [https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ MediaWiki release mailing list]
- [https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
- [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]
{{Infobox:Kin'toni Clan
|Clan Name = A'voreld Kin'toni Clan
|Other Names =
|Parent Groups = N/A
|Descended Groups = Unknown
|Demonym = Unknown
|Languages = Aundi, Ryran
|Religions = Zyzr Worship
|Government Type = Unknown
|Population = Unknown
|Areas Controlled = [[Taerel:Rinfenus Volcano|Rinfenus Volcano]]
|Allies = Unknown
|Rivals = Unknown
|Date Founded = C 4E150
|Date Disbanded = N/A
}}
==History==
The A'voreld clan is a clan in the kin'toni'ati race. Created by kin'toni that ran from the Shattering that happened from 4E 100 to 4E 150, for they did not want to get involved in the wars that were spanning the world of Terael at the time. One reason for the Shattering was that a gigantic volcanic eruption in 4E 165 had resulted in the coldest winters in recorded history. Many of the planet's volcanoes began to erupt during that cold winter and did for over 10 years, blocking out the sun and killing a lot of plant life. This meant that other animals did not get food and could not live.
Ask a question about the MediaWiki software.
This is an index of all supported configuration settings based on the DefaultSettings.php file.
Never edit DefaultSettings.php; copy appropriate lines to LocalSettings.php instead and amend them as appropriate.
If you can't find a configuration setting here, see if it is defined in DefaultSettings.php.
The variable should have some documentatio...
This landing page links to core technical documentation about MediaWiki internationalisation and localisation (i18n and L10n).
A core principle of MediaWiki is that i18n must not be an afterthought: i18n and l10n are an essential component even in the earliest phases of software development.
Isnt your code reading it line by line?
It wont match a template that spans multiple lines
How can I make it match the multiple lines?
I think instead of going through each line and outputting what doesnt match the regex, you should just grab the entire text into a variable, then use regex to replace those matches with a blank string
https://pythonexamples.org/python-re-sub
# Replace template with nothing
result = re.sub('\{\{.*?\}\}', xmlVariable, '')
And do that for each thing you want to remove
Did not work
Code
@quaint hornet
Oh i think .* doesnt match newlines. Also you still havent escaped the braces. Try that first
What could added to match newlines
What did you actually change? It looks the same
Escaped the braces does not work
tried [\s\S]*? no change
Just do output = template_regex.sub("", input_file)
To use the whole file
Oh
Well i can help fix it later today
Ty
@smoky tinsel can you put the code you're using on pastebin? discord removes backslashes
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
@quaint hornet
THis is to strip out page headers, categories templates and son so
I think this should do it: https://pastebin.com/twiGxdvq
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
@smoky tinsel
Ok
Traceback (most recent call last):
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 11, in <module>
input_file = re.sub(pattern, "", input_file)
File "/usr/lib/python3.10/re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
@quaint hornet
Traceback (most recent call last):
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 14, in <module>
input_file = pattern_regex.sub("", input_file)
TypeError: expected string or bytes-like object
Deletes all content, not just the template
what happens if you just run this:
with open("taereltxt.txt", "r") as input_file, open("output.txt", "w") as output_file:
print(str(input_file))
<_io.TextIOWrapper name='taereltxt.txt' mode='r' encoding='UTF-8'>
oh
Reverted to https://pastebin.com/DbY3X57W as that works, but does not remove templates
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
try now
Try this code: ? https://pastebin.com/twiGxdvq
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
yeah
It removed them templates, but not the headings
@quaint hornet This regrex needs to work as well
pattern = r"^(==History==|==Stone Age: Before 1E 0==|==Copper Age: 1E 1-1E 2200==|==Bronze Age: 1E 2200-1E 4400==|==Iron Age: 2E 0-2E 700==|==Ancient Age: 2E 700-2200==|==Middle Ages: 3E 0-2050==|==Early Modern Age: 2050-3E 2600==|==Industrial Age: 3E 2600-3E 2700==|==Machine Age: 3E 2700-3E 2800==|==Atomic Age: 3E 2800-3E 2850==|==Space Age: 2E 2850-2E 2900==|==Information Age: 3E 2850-3E 2900==|==Genetic Age: 3E 2950-3E 3000==|Awakening Age: 3E 3000-3E 3415==|==Shattering Age: 4E 0 - 4E 250==|==History|Geography|Plants|Animals|Biology==|==Psychology==|==Culture==|==Government==|==Military==|==Religion==|==Miscellany==|==History (A'voreld kin'toni Clan)==|==Biology (A'voreld kin'toni Clan)==|==Psychology (A'voreld kin'toni Clan)==|==Culture (A'voreld kin'toni Clan)==|==Government (A'voreld kin'toni Clan)==|==Military (A'voreld kin'toni Clan)==|==Religion (A'voreld kin'toni Clan)==|==Miscellany (A'voreld kin'toni Clan)==|\sTaerel:.?|\sCategory:.?)\n"
@smoky tinsel ok try again
All that needs regreximng out now is the [[Taerel: ]] links and the category stuff
i can make something thats easier for you to add new things that need to be removed. but it'll have to be tomorrow probably, i have to go to work in like ten minutes
Adding onto the regrex is fine as well
then just add new things by adding for example |\[\[Category\]\] to the end of it
pattern = r"==History==|==Stone Age: Before 1E 0==|==Copper Age: 1E 1-1E 2200==|==Bronze Age: 1E 2200-1E 4400==|==Iron Age: 2E 0-2E 700==|==Ancient Age: 2E 700-2200==|==Middle Ages: 3E 0-2050==|==Early Modern Age: 2050-3E 2600==|==Industrial Age: 3E 2600-3E 2700==|==Machine Age: 3E 2700-3E 2800==|==Atomic Age: 3E 2800-3E 2850==|==Space Age: 2E 2850-2E 2900==|==Information Age: 3E 2850-3E 2900==|==Genetic Age: 3E 2950-3E 3000==|Awakening Age: 3E 3000-3E 3415==|==Shattering Age: 4E 0 - 4E 250==|==History|Geography|Plants|Animals|Biology==|==Psychology==|==Culture==|==Government==|==Military==|==Religion==|==Miscellany==|==History (A'voreld kin'toni Clan)==|==Biology (A'voreld kin'toni Clan)==|==Psychology (A'voreld kin'toni Clan)==|==Culture (A'voreld kin'toni Clan)==|==Government (A'voreld kin'toni Clan)==|==Military (A'voreld kin'toni Clan)==|==Religion (A'voreld kin'toni Clan)==|==Miscellany (A'voreld kin'toni Clan)==|\sTaerel:.?|\sCategory:.?|\[\[Category\]\]"
like that
Category names and content in [[Taerel: ]] vary the whole [[Category:X]] line and all [[Taerel:X]] needs removal
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 12, in <module>
pattern_regex = re.compile(pattern)
File "/usr/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.10/sre_parse.py", line 843, in _parse
raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 0
did you remove the first parenthsis
Where?
paste the regex
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
at the start of the regex, remove the (
@smoky tinsel ok i gotta go, i can help out tomorrow if you need
@verbal ginkgo hey sorry i probably cant look at it until tomorrow, im busier than i thought id be
https://pastebin.com/twiGxdvq This will also remove all [[Category:...]] and [[Tarael:...]]. and you can add more patterns to remove easily
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.