Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers. Source: https://www.kaggle.com/datasets/saurabhbagchi/sample-blog-corpus
README.md
Sample Blog Corpus
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17)
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words.
All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
Sample
id gender age topic sign date text
0 2059027 male 15 Student Leo 14,May,2004 Info has been found (+/- 100 pages,...
1 2059027 male 15 Student Leo 13,May,2004 These are the team members: Drewe...
2 2059027 male 15 Student Leo 12,May,2004 In het kader van kernfusie op aarde...
3 2059027 male 15 Student Leo 12,May,2004 testing!!! testing!!!
4 3581210 male 33 InvestmentBanking Aquarius 11,June,2004 Thanks to Yahoo!'s Toolbar I can ...
Creator
Old Monk (Owner)
License
CC0: Public Domain
Tags
Text, Religion and Belief Systems, NLP
File List | Total items: 6 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
data | |||
notebooks | |||
.gitattributes | |||
.gitignore | |||
README.md | |||
requirements.txt |
About
Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers. Source: https://www.kaggle.com/datasets/saurabhbagchi/sample-blog-corpus