Documenting, Diagramming, and Critiquing the Pligg Database

Warning: This post shows how much of a geek I am

Recently, I have begun working with Pligg, which is a Web 2.0 content management system modeled after the popular site Digg.com. I am interested in helping this project as there are a few things I’d like to do with it.

To start learning how Pligg operates, I decided to look at the guts underneath, diagram the Pligg MySQL database, and see what functions are making calls to the Pligg tables. For whatever strange reason, I love MySQL databases, so this was a good place to start.

I then created a Pligg wiki page describing the database tables while linking to the database diagram image and to the original MS Visio file.

I documented the database which is currently in the Subversion trunk — post Beta 9.1 code.

Here are more detailed table descriptions, and have many comments in bold.

trackbacks
Makes sense to me. In libs/trackback.php we can probably change the User Agent on line 88 to pligg instead of meneame (already fixed).

What I don’t like is all the indexing going on here. I’m not sure if the indexing is justified. A lot of the indexes do have searches that utilize them, though.

Also, phpMyAdmin is telling me that “UNIQUE and INDEX keys should not both be set for column `trackback_link_id`”

Suggestion: Fix indexing per phpMyAdmin
messages
Uses kmessaging 3rd party tool. Naming conventions are different, but it’s fine since we’re just using kmessaging’s setup.

Pligg was not originally written by English-speaking natives, nor was kmessaging, so watch out for functions spelled “messege” instead of “message”.

Note that I’ve linked sender and receiver to the user_id table, but this is not a true link in the database sense. sender and receiver get entered into the messages table in a roundabout way, not through one MySQL call. But the link still exists because the code is checking that the user_id exists in the users table (do a search for the SendMessege function and you’ll see where it gets called… the /modules/messaging/ section).
config
Looks fine, but do we really need an INT type for var_id ? I’m doubting there will ever be 2 billion config options. SMALLINT would be fine (-32768 to 32768, or SMALLINT UNSIGNED 0 to 65535)

Suggestion: Change var_id to SMALLINT
categories
category_safe_name takes out spaces and other potentially dangerous stuff.

I was originally confused as to where categories get inserted. I see the insert call in admin_categories.php line 81, but this makes no sense to me:

$sql = “insert into `” . table_categories . “` (`category_name`) VALUES (‘new category’);”;

However, this is seen in dbtree.php
comments
Pretty straight-forward table. I’m not sure what comment_nick is for. It looks like it’s never used.

Not sure what comment_karma is for either. It’d be nice to understand the karma system better.

Also, I really don’t think comment_votes needs to be an int. I don’t see anyone getting 2 billion votes. Save space because there might be a lot of comments out there, make it a smallint.

Suggestions: a. Remove comment_nick b. Change comment_votes to smallint
formulas
New in post-Pligg 9.1 svn code. It looks like there’s going to be a new formulas module that you can put your own formulas into.
friends
Simply keep track of the friend_to and friend_from in the same row. Not sure why this needs to be a display width of 20 though.
links
Every link goes in here. Points to the author (the user_id from users), status, number of votes, and a whole ton of indexing which makes sense.

Note that the link_votes will contain number of votes, but more specific vote data will be kept in the votes table.

There is an index on link_url, link_url_title, link_title, link_content, and link_tags. I don’t see the reasoning behind that.

Suggestions: Consider removing the index on link_url, link_url_title, link_title, link_content, and link_tags
modules
A stand-alone table to keep track of your modules. See modules/modules_manage.php — it does an INSERT when you install, and an UPDATE when you disable or enable a module.
pageviews
This table is going to get a ton of traffic.

pv_id is an unsigned int.
pv_type seems to be either story or profile. Maybe some other stuff. Since this is a varchar, this is going to be horribly wasteful. Why not an enum for this column?
pv_page_id is either the user_id of the user if you’re looking at someone’s profile, or it’s the link_id if you’re looking at a story. Used by out.php, story.php, and user.php
pv_user_id is the id of the user who’s looking at this.

This is great data, but I’m concerned. Every time someone clicks on anything, a varchar(20) is created.

Suggestion: Use enum instead of VARCHAR(20) for pv_type
saved_links
Apparently, somewhere you’re able to save links for yourself. Looks like a type of bookmarking thing. Odd thing is, I can’t find where you actually do this in the site. The code is real basic though — if you add one of these “bookmarks”, user_add_remove_links.php will do it for you. Maybe it’s disabled on my site.
tags
editlink.php calls tags_insert_string, which inserts a tag into this table for each tag you put in. tag_link_id then points to the links table.

In editlink.php, tags_insert_string calls and puts the global $dblang into the tag_lang column. So it just takes on whatever language your site is. I don’t know if this is necessary. I also don’t see why we have an index on tag_lang and tag_date.

Note that this table is a bit different, in that tag_link_id isn’t really a unique primary key on its own. You can have multiple tags with one link_id. Thus, there can and will be multiple tag_link_id of the same integer.
totals
This table simply keeps track of the number of published stories and unpublished stories. See libs/html1.php — line 630 has the function totals_regenerate and totals_adjust_count that should help you see what it’s doing. These functions get called primarily by link.php
users
The users and links tables are the heart of it all. Everything references this table, and it’s clear to understand.

I have some concerns:
a. It gets called so often, and we’re using int(20). Is anyone expecting 2 billion users? Why a display width of 20? An unsigned mediumint will give you 16 million users and save you a byte every time, which could add up.

b. We are using varchar(20) for IP addresses. That is wasteful.

c. This table is huge. It gets referenced very often. Personally, I prefer to split it into two tables, users and userprofiles, so that we don’t call this massive thing all the time. However, as long as the SQL calls only call what we need, and we’re indexing the right stuff, we should be fine with this.

d. We can probably get rid of user_lang and any code associated to it since we killed the languages table.

Suggestions:
a. use MEDIUMINT instead of INT(20)
b. For IP addresses, I recommend using INT UNSIGNED, and then storing it with the INET_ATON() and INET_NTOA functions.
c. Remove user_lang column
votes
Anytime a vote is made, data goes in here, linking the date (vote_date), the link (vote_link_id), and the user (vote_user_id). vote_type is either ‘links’ or ‘comments’.

Suggestions:
a. Personally, I think that we’re creating extra overhead with vote_type. The index on vote_type, vote_link_id, vote_user_id, and vote_ip is going to get enormous.

Why not have a separate table for vote_comments and vote_links?

b. Another case of vote_ip = varchar(64). I again recommend INT UNSIGNED like in the users table.

Overall, this isn’t that complicated, and it works well. Let me know what you think of my suggestions, I’ve only designed one large website database on my own.

Next, I will investigate whether using InnoDB with Foreign Key constraints is a good idea rather than MyISAM, which is the current Pligg default.

I am now ready to start hacking away with some modules!

…and this is what I do in my spare time while you are watching too much TV.

2 Responses to "Documenting, Diagramming, and Critiquing the Pligg Database"

By Uncle A April 2, 2007 - 2:21 am

Thanks for the info. Now I have something to do instead of watching the boring Ohio state-Florida Game. I totally agree with your conclusions, especially using enum instead of VARCHAR(20). Go Buckeys oops I mean enum.

Uncle (the senior geek) A
By socialcmsbuzz.com March 25, 2008 - 5:50 pm

Documenting, Diagramming, and Critiquing the Pligg Database…

Recently, I have begun working with Pligg, which is a Web 2.0 content management system modeled after the popular site Digg.com. I am interested in helping this project as there are a few things I’d like to do with it….

Documenting, Diagramming, and Critiquing the Pligg Database

2 Responses to "Documenting, Diagramming, and Critiquing the Pligg Database"

By Uncle A April 2, 2007 - 2:21 am

By socialcmsbuzz.com March 25, 2008 - 5:50 pm

Leave a Reply Cancel reply