Sticking a Shovel
The first step will be to go to the TensorFlow official website and read ML for Beginners.
TensorFlow is a Google hacking team and the most popular ML library that supports Python, Java, C ++, and Go.
Also, there is Scikit-learn – another Python-oriented library for ML. A significant amount of out-of-the-box algorithms are key strengths of this library. There is a tutorial for defining the language with the text written with the help of Scikit-learn.
So, as an example, let’s set such a goal – to teach a model to determine the presence of SQL injection in a text string.
First of All Data Sets Go
A type of classification task is when the algorithm, after receiving data, identifies specific categories.
Features are referred to as entries where the algorithm will search for patterns.
A label is called a category to which particular features belong. It is crucial to remember that the input data can have several features, but only one label.
We’ll use a supervised learning algorithm for that task. This means that the algorithm will receive both features and labels while learning.
With the help of ML, data collection is priority #1 in solving any problem.
Take a CSV file with three types of data: random emails (20 th.), random emails with SQL injections (20 th.) and pure SQL injections (10 th.).
Now, the benchmark data should be read. The function returns an X sheet that contains features, a Y sheet that contains labels for every feature, and a label_names sheet. The last one simply contains a text definition for labels, which is necessary for processing results conveniently.
Further, the data should be divided into training and test sets. For the purpose function of cross_validation.train_test_split () to be optimal, it will shuffle the records and return us four sets of data – two training and two tests for features and labels.
Then we initialize the vectorizer object, which will read the data transferred character by character, combine them into N-grams and translate them into numerical vectors, which is capable of perceiving the ML algorithm.
Feeding the Data
The next step is to initialize the pipeline and pass it to the previously created vectorizer and algorithm, which we will use to analyze our data set. That’s how the logistic regression algorithm looks like.
The model is ready to digest the data. Now we transfer feature and label training sets to our pipeline and the model begins training. For the next line, we’ll skip the features test set through the pipeline, to get the number of correctly guessed data.
If you want to know how correct the model is in predicting results, you can compare the guessed data and the labels test sheet.
The accuracy of the model is determined by a value from 0 to 1 and can be converted into percentages. This model returns the correct answer 100% of the time. Of course, using real data, such a result won’t be easily achieved, and the task is relatively simple.
The final touch is to save the model in a trained way so that it can be used without re-training in any other Python program. We serialize the model into a pickle file using the built-in Scikit-learn function:
A small demonstration of how to use the serialized model in the other program.
At the output, we’ll get the following result:
As you can see, the model confidently defines SQL injection attacks.
Conclusion
As a result, we have a well-trained model to determine SQL injections. Theoretically, we can stick it in the backend. Also, in case of injection attacks, it is determined to redirect all the fake requests to hide all possible vulnerabilities.
These are the very first steps in the ML field. Our passion for ML and AI subjects isn’t a coincidence, as one of our solutions such as LS Intranet already contains their smart algorithms. To learn more about our artificial assistant for corporations read this copy.
Hopefully, those who are intimidated by programming will find this material interesting and be inspired to begin their own machine learning journey.
Leave a Comments