Go back to previous topic
Forum nameGeneral Discussion Archives
Topic subjectRE: this is bananas, as usual. you basically just tweaked that
Topic URLhttp://board.okayplayer.com/okp.php?az=show_topic&forum=18&topic_id=133698&mesg_id=133716
133716, RE: this is bananas, as usual. you basically just tweaked that
Posted by Triptych, Sun Mar-15-09 11:08 PM
>lessonizing script you had? (that allowed you to show "just
>the links, m'am" on PAMS type joints?

Actually this was ALL new code. The other projects only looked at a thread's meta information, like authorship, replies, views, etc. This code had to actually make sense of dcforums awful, awful, awful HTML and try to parse out each post as well as the thread's tree structure.

This was also an interesting case because the post was so fucking huge. There are basically two options if you want to use an existing library to parse HTML. You can use slow, memory-hungry libraries to parse crappy HTML, or you can use super fast and efficient libraries to parse perfect HTML. This particular post completely broke my usual methods of parsing crappy HTML, and of course the fast methods of parsing crappy HTML weren't an option.

So I had to create a completely custom pure regular expression solution (read: FAST) to parse out a post's raw relevant data, and then recombine that into Post objects. Or something like that.

Right now for each reply in a post I'm trapping:
author (name, but not ID)
title
message
post num
parent num

Still need to get:
author id
time stamp

>but all of that is way easier said than done.

True

>this is crazy, though.

Thanks

>onliest thing you could do is link back (from your parsed out
>copies of ?uest's replies) to the original poast so that, if
>they wanted, folks could hop over there to add they lulz and
>whatnot. but that's really not necessary.

I'd need to trap the post id and thread id for that I think. Probably not that hard. It looks like people are actually using this shit. If I see it getting linked to or anything I'll definitely do some upgrades.

>this is a great reader.
>
>the other thing you could do which would be hot would be to
>change the cell shading or text color based upon a timestamp
>date. i imagine you crawl the poast to catch any new updates.
>so that way, someone could look at this page, and see
>something in orange, and know that that's the equivalent of DJ
>CLUE screamin in they ears.

I really thought about a bunch of stuff like this - I wanted to offer different sorting options, so you could sort by recency (post number) or popularity (number of replies) or alphabetically (as it is now).

Then I was gonna shade cells on a range from white to blue or something based on HOW popular or recent it was or whatever. But by the time I cracked through dcforums horrible HTML I just felt like getting something up quickly.