1
0
Fork 0

Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers. Source: https://www.kaggle.com/datasets/saurabhbagchi/sample-blog-corpus

README.md

Sample Blog Corpus

Each blog is presented as a separate file, the name of which indicates a blogger id# and the bloggers self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17)
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words.
All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Sample

        id gender  age              topic      sign          date                                               text
0  2059027   male   15            Student       Leo   14,May,2004             Info has been found (+/- 100 pages,...
1  2059027   male   15            Student       Leo   13,May,2004             These are the team members:   Drewe...
2  2059027   male   15            Student       Leo   12,May,2004             In het kader van kernfusie op aarde...
3  2059027   male   15            Student       Leo   12,May,2004                   testing!!!  testing!!!
4  3581210   male   33  InvestmentBanking  Aquarius  11,June,2004               Thanks to Yahoo!'s Toolbar I can ...

Creator

Old Monk (Owner)

License

CC0: Public Domain

Tags

Text, Religion and Belief Systems, NLP

File List Total items: 6
Name Last Commit Size Last Modified
data cleaning ambiguous unicode chars 9 months ago
notebooks profiling notebook 9 months ago
.gitattributes Initial commit 79 B 9 months ago
.gitignore profiling notebook 6.7 KiB 9 months ago
README.md docs: Clarify age group details in README 1.7 KiB 9 months ago
requirements.txt profiling notebook 84 B 9 months ago

About

Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers. Source: https://www.kaggle.com/datasets/saurabhbagchi/sample-blog-corpus

Repository Size

Loading repo size...

Commits 7 commits

File Types