Stripping metadata, headings and categories out of dump parsed to .txt | Miraheze | Page 1

smoky tinsel Apr 13, 2023, 6:36 PM

#

Stripping metadata, headings and categories out of dump parsed to .txt

quaint hornet Apr 14, 2023, 9:35 AM

#

@smoky tinsel Could just be a problem with the regex. Maybe try pasting the xml and the regex into a tester like this and tweak it until it does what you want: https://regex101.com/

regex101

regex101: build, test, and debug regex

Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python, GO, JavaScript, Java, C#/.NET, Rust.

smoky tinsel Apr 14, 2023, 9:37 AM

#

quaint hornet <@456226577798135808> Could just be a problem with the regex. Maybe try pasting ...

The text I am strilling from is plain text and very big

quaint hornet Apr 14, 2023, 9:41 AM

#

smoky tinsel The text I am strilling from is plain text and very big

What about just copying the part of the text that should be removed but isnt, into the tester, and see if you can figure it out that way?

smoky tinsel Apr 14, 2023, 9:41 AM

#

quaint hornet What about just copying the part of the text that should be removed but isnt, in...

Iwould prefer having a fixed verson given

quaint hornet Apr 14, 2023, 9:46 AM

#

smoky tinsel Iwould prefer having a fixed verson given

Could you paste a bit of the xml into pastebin which has a part that should be removed but isnt?

smoky tinsel Apr 14, 2023, 9:48 AM

#

It is a .txt file

#

Well, the resault is

#

https://pastebin.com/Vmvc6NBW

Pastebin

Resalt if stripper - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

quaint hornet Apr 14, 2023, 9:50 AM

#

smoky tinsel Well, the resault is

And all of that text should have been removed? Or just some of it?

smoky tinsel Apr 14, 2023, 9:50 AM

#

Stripper code

https://pastebin.com/fdHjgkkp

Pastebin

Stripper Code - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

smoky tinsel Apr 14, 2023, 9:57 AM

#

quaint hornet And all of that text should have been removed? Or just some of it?

And all, for each page within the XML

quaint hornet Apr 14, 2023, 10:08 AM

#

@smoky tinsel ok well first of all, you need to escape characters. put a backslash before all the forward slashes

#

having such a long regex line is confusing. maybe make an array of regex objects which each match to one thing, and loop through the array to see if any match

smoky tinsel Apr 14, 2023, 10:23 AM

#

https://pastebin.com/yDUkJbum - with the backslashes

Pastebin

HeadingRemove.py - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

quaint hornet Apr 14, 2023, 10:25 AM

#

regexList = []

regexList.append(re.compile(r"Genetic Age: 3E 2950-3E 3000"))
regexList.append(re.compile(r"Awakening Age: 3E 3000-3E 3415"))
# ... add as many expressions as needed

for line in input_file:
  try:
    for regex in regexList:
      if regex.match(line):
        raise Exception()
    outputfile.write(line)
  except Exception:
    continue

#

i havent actually tried this code, but thats what im thinking

#

to make it easier to understand what the regex is doing

smoky tinsel Apr 14, 2023, 10:34 AM

#

To strip out the meta data with mwpaserfromhell @quaint hornet

#

The currant resalt of that is this: https://pastebin.com/V0qfdNsF

Pastebin

Sampe, chunk xml to txt - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

quaint hornet Apr 14, 2023, 10:39 AM

#

i think there are better ways to do this. probably an xml parser for python

smoky tinsel Apr 14, 2023, 10:39 AM

#

quaint hornet i think there are better ways to do this. probably an xml parser for python

Why not mwfromhell?

quaint hornet Apr 14, 2023, 10:40 AM

#

smoky tinsel Why not mwfromhell?

as far as i know, thats for taking wikitext and turning it into plain text. it doesnt have functions for removing metadata from an xml file

#

@smoky tinsel using a module like this: https://www.askpython.com/python/examples/python-xml-parser
you would grab only the sections of xml you want. then youd run those sections through the mwparser to get plain text

AskPython

Vijaykrishna Ram

Python XML Parser - AskPython

Ever stuck with an annoying XML file that you need to parse to get important values? Let's learn how to create a Python XML parser.

#

if you need help with doing all that, ill have to come back to it at a later time

smoky tinsel Apr 14, 2023, 11:32 AM

#

This code works, but only strips one page @quaint hornet

smoky tinsel Apr 14, 2023, 12:10 PM

#

got a working code, thank you for your help

quaint hornet Apr 14, 2023, 1:33 PM

#

smoky tinsel **got a working code, thank you for your help**

Oh great to hear!

smoky tinsel Apr 14, 2023, 1:33 PM

#

quaint hornet Oh great to hear!

Regrex to match all content inside a template?

{{.*?}}) did not work

fleet sphinxBOT Apr 14, 2023, 1:34 PM

#

https://meta.miraheze.org/wiki/.*%3F?action=edit&redlink=1

quaint hornet Apr 14, 2023, 1:34 PM

#

Need to escape the brackets: \{\{.*?\}\}

smoky tinsel Apr 14, 2023, 1:38 PM

#

quaint hornet Need to escape the brackets: `\{\{.*?\}\}`

Still has the template stuff

quaint hornet Apr 14, 2023, 1:40 PM

#

What exactly are you trying to do?

smoky tinsel Apr 14, 2023, 1:40 PM

#

quaint hornet What exactly are you trying to do?

Remove the templates and content within the templates

quaint hornet Apr 14, 2023, 1:42 PM

#

I dont know why that wouldnt work. Whats an example of text that has a template you want to remove?

smoky tinsel Apr 14, 2023, 1:43 PM

#

<strong>MediaWiki has been installed.</strong>

Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.

== Getting started ==

[https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
[https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
[https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ MediaWiki release mailing list]
[https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
[https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]
{{Infobox:Kin'toni Clan
|Clan Name = A'voreld Kin'toni Clan
|Other Names =
|Parent Groups = N/A
|Descended Groups = Unknown
|Demonym = Unknown
|Languages = Aundi, Ryran
|Religions = Zyzr Worship
|Government Type = Unknown
|Population = Unknown
|Areas Controlled = [[Taerel:Rinfenus Volcano|Rinfenus Volcano]]
|Allies = Unknown
|Rivals = Unknown
|Date Founded = C 4E150
|Date Disbanded = N/A
}}

==History==

The A'voreld clan is a clan in the kin'toni'ati race. Created by kin'toni that ran from the Shattering that happened from 4E 100 to 4E 150, for they did not want to get involved in the wars that were spanning the world of Terael at the time. One reason for the Shattering was that a gigantic volcanic eruption in 4E 165 had resulted in the coldest winters in recorded history. Many of the planet's volcanoes began to erupt during that cold winter and did for over 10 years, blocking out the sun and killing a lot of plant life. This meant that other animals did not get food and could not live.

Help:Contents

Ask a question about the MediaWiki software.

Manual:Configuration settings

This is an index of all supported configuration settings based on the DefaultSettings.php file.
Never edit DefaultSettings.php; copy appropriate lines to LocalSettings.php instead and amend them as appropriate.
If you can't find a configuration setting here, see if it is defined in DefaultSettings.php.
The variable should have some documentatio...

Manual:FAQ

Localisation

This landing page links to core technical documentation about MediaWiki internationalisation and localisation (i18n and L10n).
A core principle of MediaWiki is that i18n must not be an afterthought: i18n and l10n are an essential component even in the earliest phases of software development.

quaint hornet Apr 14, 2023, 1:44 PM

#

Isnt your code reading it line by line?

#

It wont match a template that spans multiple lines

smoky tinsel Apr 14, 2023, 1:44 PM

#

How can I make it match the multiple lines?

quaint hornet Apr 14, 2023, 1:46 PM

#

I think instead of going through each line and outputting what doesnt match the regex, you should just grab the entire text into a variable, then use regex to replace those matches with a blank string

#

https://pythonexamples.org/python-re-sub

# Replace template with nothing
result = re.sub('\{\{.*?\}\}', xmlVariable, '')

#

And do that for each thing you want to remove

smoky tinsel Apr 14, 2023, 1:58 PM

#

quaint hornet And do that for each thing you want to remove

Did not work

#

Code

#

@quaint hornet

quaint hornet Apr 14, 2023, 2:01 PM

#

Oh i think .* doesnt match newlines. Also you still havent escaped the braces. Try that first

smoky tinsel Apr 14, 2023, 2:01 PM

#

What could added to match newlines

quaint hornet Apr 14, 2023, 2:01 PM

#

What did you actually change? It looks the same

smoky tinsel Apr 14, 2023, 2:02 PM

#

Escaped the braces does not work

quaint hornet Apr 14, 2023, 2:02 PM

#

[\s\S]*?

#

Youre still doing it line by line

smoky tinsel Apr 14, 2023, 2:03 PM

#

tried [\s\S]*? no change

quaint hornet Apr 14, 2023, 2:04 PM

#

Just do output = template_regex.sub("", input_file)

#

To use the whole file

#

Oh

#

Well i can help fix it later today

smoky tinsel Apr 14, 2023, 2:05 PM

#

quaint hornet Well i can help fix it later today

Ty

quaint hornet Apr 14, 2023, 3:13 PM

#

@smoky tinsel can you put the code you're using on pastebin? discord removes backslashes

smoky tinsel Apr 14, 2023, 3:20 PM

#

https://pastebin.com/DbY3X57W

Pastebin

Remove regrex - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

#

@quaint hornet

#

THis is to strip out page headers, categories templates and son so

quaint hornet Apr 14, 2023, 3:27 PM

#

I think this should do it: https://pastebin.com/twiGxdvq

Pastebin

Regex Remover - Fixed - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

#

@smoky tinsel

smoky tinsel Apr 14, 2023, 3:30 PM

#

Ok

#

Traceback (most recent call last):
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 11, in <module>
input_file = re.sub(pattern, "", input_file)
File "/usr/lib/python3.10/re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

@quaint hornet

quaint hornet Apr 14, 2023, 3:33 PM

#

um ok i edited it, try the new version

#

@smoky tinsel

smoky tinsel Apr 14, 2023, 3:35 PM

#

quaint hornet <@456226577798135808>

Traceback (most recent call last):
File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 14, in <module>
input_file = pattern_regex.sub("", input_file)
TypeError: expected string or bytes-like object

quaint hornet Apr 14, 2023, 3:35 PM

#

hmm

#

@smoky tinsel try now

smoky tinsel Apr 14, 2023, 3:40 PM

#

quaint hornet <@456226577798135808> try now

Deletes all content, not just the template

quaint hornet Apr 14, 2023, 3:42 PM

#

what happens if you just run this:

with open("taereltxt.txt", "r") as input_file, open("output.txt", "w") as output_file:
    print(str(input_file))

smoky tinsel Apr 14, 2023, 3:42 PM

#

<_io.TextIOWrapper name='taereltxt.txt' mode='r' encoding='UTF-8'>

quaint hornet Apr 14, 2023, 3:43 PM

#

oh

smoky tinsel Apr 14, 2023, 3:44 PM

#

quaint hornet oh

Reverted to https://pastebin.com/DbY3X57W as that works, but does not remove templates

Pastebin

Remove regrex - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

quaint hornet Apr 14, 2023, 3:45 PM

#

smoky tinsel Reverted to https://pastebin.com/DbY3X57W as that works, but does not remove te...

try now

smoky tinsel Apr 14, 2023, 3:45 PM

#

Try this code: ? https://pastebin.com/twiGxdvq

Pastebin

Regex Remover - Fixed - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

quaint hornet Apr 14, 2023, 3:46 PM

#

yeah

smoky tinsel Apr 14, 2023, 3:46 PM

#

quaint hornet yeah

It removed them templates, but not the headings

#

@quaint hornet This regrex needs to work as well

quaint hornet Apr 14, 2023, 3:47 PM

#

@smoky tinsel ok try again

smoky tinsel Apr 14, 2023, 3:53 PM

#

quaint hornet <@456226577798135808> ok try again

All that needs regreximng out now is the [[Taerel: ]] links and the category stuff

fleet sphinxBOT Apr 14, 2023, 3:53 PM

#

https://meta.miraheze.org/wiki/Taerel:?action=edit&redlink=1

quaint hornet Apr 14, 2023, 3:54 PM

#

i can make something thats easier for you to add new things that need to be removed. but it'll have to be tomorrow probably, i have to go to work in like ten minutes

smoky tinsel Apr 14, 2023, 3:55 PM

#

quaint hornet i can make something thats easier for you to add new things that need to be remo...

Adding onto the regrex is fine as well

quaint hornet Apr 14, 2023, 3:56 PM

#

remove the \n, thats not needed

#

and the parentheses around it all

smoky tinsel Apr 14, 2023, 3:57 PM

#

I just want to fix the category removal stuff

#

The rest works

quaint hornet Apr 14, 2023, 3:57 PM

#

then just add new things by adding for example |\[\[Category\]\] to the end of it

#

#

like that

smoky tinsel Apr 14, 2023, 3:58 PM

#

Category names and content in [[Taerel: ]] vary the whole [[Category:X]] line and all [[Taerel:X]] needs removal

fleet sphinxBOT Apr 14, 2023, 3:58 PM

#

https://meta.miraheze.org/wiki/Category:X?action=edit&redlink=1
https://meta.miraheze.org/wiki/Taerel:X?action=edit&redlink=1

quaint hornet Apr 14, 2023, 3:59 PM

#

or |\[\[.*?\]\]

#

that will remove everything between [[ ]]

smoky tinsel Apr 14, 2023, 4:00 PM

#

quaint hornet that will remove everything between [[ ]]

File "/home/demon/Desktop/WikiToolsDoNotDelete/HeadingRemoveTestbed.py", line 12, in <module>
pattern_regex = re.compile(pattern)
File "/usr/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.10/sre_parse.py", line 843, in _parse
raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 0