![]() Alternatively, we can create a new feature for each possible category, and set the feature to be 1 for each sample having that category, and otherwise set it to be 0 (called one-hot encoding). We can simply assign each category an integer randomly (called label encoding). There are quite a few ways to encode categorical data. Who’s to say that Colorado is “greater than” Minnesota? Or DHL “less than” FedEx? To represent categorical data, we need to find a way to encode the categories numerically. ![]() Categorical features (such as state, merchant ID, domain name, or phone number) don’t have an intrinsic ordering, and so most of the time we can’t just represent them with random numbers. How to represent categorical features is less obvious. This makes sense for continuous features, where a larger number obviously corresponds to a larger value (features such as voltage, purchase amount, or number of clicks). Most machine learning algorithms require the input data to be a numeric matrix, where each row is a sample and each column is a feature.
0 Comments
Leave a Reply. |