Where to start? Text Extraction

@Loopedcandle · 1 year ago

Where to start? Text Extraction

@coolkicks · 1 year ago

Couple of options to start out with, Topic Labeling and Topic Extraction.

Topic Labeling is a classic example of supervised learning, or using ML with training data to classify new observations based on patterns found in training data.
Topic Extraction is a classic example of unsupervised learning, or attempting to identify patterns without training data.

I’m going to start with labeling, or classification here. There are plenty of tools to train a model to classify text in to categories, I’d recommend starting with this scikit-learn tutorial to see what’s involved before you start.

With any classification problem, you need good training data. You mentioned you’ve scraped 400 job postings, and I’m assuming you would want to using the job description to predict the job title. Some quick math, you’ll want to withhold 30% of your data to test your model, so that leaves 280 postings to train. I would recommend at least 100 descriptions per job title, so if you have 2-3 job titles, perfect, you’re ready to follow that tutorial with your own data!

If you have more that that, you probably won’t be able to do labeling/classification here, and will instead want to do topic extraction, where you’ll throw your walls of text at the machine and let the machine tell you the patterns it finds.

Topic modeling with spaCy and sci-kit learn is a great overview of this process, and plugging your own data in is pretty straightforward.

Both of these examples don’t even really scratch the surface of what’s possible with text based ML these days, but are perfectly viable tools to run quickly and on commodity hardware.

@Loopedcandle · 1 year ago

Thanks for this! I’ll start learning!

A friend mentioned I should start with a pre-trained model because 400 (and growing 50ish / week with my crawler) is just not nearly enough. Then do continued learning on that pre-trained model. Does that sound right?

@coolkicks · 1 year ago

Yeah, model training is hard. Like capital H HARD. you need a bunch of data and it needs to be high quality.

New York is the financial center of USA, so separating finance jobs from job postings written by someone using New England vernacular is a step you need to go through to make sure your data is high enough quality.

So if you are just starting, use 20 newsgroups dataset in those links, it’s pretty good data with a ton of resources written about it. It’s not fun data, but it isn’t as likely to fall victim to biases in data you aren’t expecting.