My Photo

 

  • Subscribe with Kindle

« Feed Reader Trends | Main | Digg This Post: Trendy Buttons »

January 25, 2007

The Blog Authorship Corpus

The number of blog corpora is slowly increasing (motivating some discussion on standards for annotation?). Here is one that I was peripherally aware earlier but just rediscovered: The Blog Authorship Corpus.

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.  


Moshe Koppel has done plenty of work in the area of authorial analysis. Note that this is a zero barriers to entry corpus - just download it. It isn't very large, but certainly of value for researchers in the authorial analysis area.

 

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c994053ef00d8350afc1453ef

Listed below are links to weblogs that reference The Blog Authorship Corpus:

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Twitter Updates

    follow me on Twitter

    July 2009

    Sun Mon Tue Wed Thu Fri Sat
          1 2 3 4
    5 6 7 8 9 10 11
    12 13 14 15 16 17 18
    19 20 21 22 23 24 25
    26 27 28 29 30 31  

    Categories

    Blog powered by TypePad