Pandas is a common library for data scientists. There are different ways to process a Pandas DataFrame, but some ways are more efficient than others. This article will provide you 4 efficient ways to:
Let’s get started!
Imagine we want to assign new columns whose values depend on the values of existing columns. …
As a data scientist, it is important to make sure your functions work as expected. A good practice is to write a small function then test your function with unit testing.
Rather than trying to debug a big chunk of code, it is better to break your code down into smaller pieces and make sure the smaller pieces work.
Pytest is the ideal framework that makes it easy to write small tests in Python. I covered the benefits of unit testing and how to get started with pytest here. In the last article, I have shown how to:
While working with time series, the values of your datasets might be affected by holidays, which day of the week, and how many days in a month. Is there a way that you can extract these features from your date feature with one line of code?
Yes! In this article, I will show you how to extract the three features mentioned above from the date column using two different Python libraries.
We will use the daily female births dataset. This dataset describes the number of daily female births in California in 1959.
Download the data using:
Datapane randomly assigns the time so you can ignore the time in the table. …
Data does not only contain numbers but also text. Knowing how to process text fast will help you to analyze your data faster and gain more insights from your data.
Text processing does not need to be difficult. Wouldn’t it be great if all we need to do to find the sentiment of the text, tokenize text, find word and noun phrase frequencies, or correct spelling is one single line of code? That is when TextBlob comes in handy.
TextBlob aims to provide access to common text-processing operations through a familiar interface. …
Sklearn is a great library with a variety of machine learning models that you can use to train your data. But if your data is big, it might take you a long time to train your data, especially when you experiment with different hyperparameters to find the best model.
Is there a way that you can increase the speed of training your machine learning model by 150 times faster than using Sklearn with minimal change? Yes, you can do that with cuML.
Below is the chart that compares the time it takes to train the same model using Sklearn’s
RandomForestClassifier and cuML’s
As a data scientist, you might find yourself spending a lot of time on your computer and keyboard. If the construction worker has his power tool, data scientists have their keyboard as their to-go tool. If you can find a way to use your tool more efficiently, you will save yourself a lot of time. That is why I am a big fan of keyboard shortcuts.
If you don’t use keyboard shortcuts frequently, it is time to switch it up. It might take a little bit of time to master the keyboard shortcuts, but once you master them, you will find yourself coding much faster without using a mouse between lines. If you save 30' per day by using keyboard shortcuts, you can have 3.5 …
When putting your code into production, you will most likely need to deal with organizing the files of your code. It can be really time-consuming to read, create, and run many files of data. This article will show you how to automatically
These tricks have saved me a lot of time while working on my data science projects. I hope you will find them useful as well!
If we have multiple data to read and process like…
As a data scientist, it is essential to have good coding practices because it is easier for coworkers viewing your code and for you to avoid confusion when revisiting your code in the future. You want your code to be both clean and easy to understand.
Thus, it is essential to:
Instead of trying to pay attention to several small details in your code at once, why not using extensions to assist and remind you? …
Git is a powerful tool for version control. It allows you to go back and forth between different versions of your code without being afraid of losing the code you change. As a data scientist, you might not only want to control different versions of your code but also control different versions of your data for the same reason: you don’t want to lose the previous data when the data is changed.
But Git is not ideal for database version control because for two reasons:
git pulland it was a…
As a data scientist or programmer, it might take you a lot of time to process your data and train your model. It is inefficient to constantly check the training on your screen, especially when the training can take hours or days. Are there ways that you can get a notification when your training is done with Python?
Yes, you can get notified by:
Each of these methods just takes you 2 or 3 more lines of codes. Let’s find out how to create your own system to get notified using these methods.
If you are doing something close to your computer while waiting for the training to be done, simply making a noise when the training is complete is a good enough solution. …