Many times Data Engineers get confused when exactly to use Window Functions and when they are propriate over aggregate functions? The main two cases for Window Functions would be following:
What is window function
A window function is a calculation across a set of rows in a table that are somehow related to the current row. Means they can be used for calculating running totals that incorporate the current row or, ranking records across rows, inclusive of the current…
CASE statements in SQL very helpful when it comes to the problem where result depends on a specific condition that could be apply to column. CASE statement helps to create Derive column that means you will take existing columns and modify it. By using CASE statement it is the way to perform “IF” “THEN” logic in SQL. The most important tips for using CASE statemen are following:
What is Data Cleaning
Data cleaning is foundational skill to be a Data Scientist or Data Analyst. By definition, data cleaning is a process of cleaning up raw data to make it usable and ready for analysis. Following are most common cases when data cleaning need to be preformed.
Data preprocessing and normalization become very important when it comes to the implementation of different Machine Learning Algorithms. As data preprocessing can affect the outcome of the learning model significantly, it is very important that all features are on the same scale. Normalization is important in such algorithms as k-NN, support vector machines, neural networks, principal components. The type of feature preprocessing and normalization that’s needed can depend on the data.
There are several different methods for data rescaling. The images below shows the four most common that could be used in machine learning algorithms.
The first plot under original…
One of the tasks when building a supervised learning model, whether it's for classification or regression, is to create a model that will make correct predictions learning from the training data. But the model will be useless if we can not make correct predictions on unseen set of data as well . This ability to perform well on a hold out test set is the algorithm’s ability to generalize. But how do we know if the trained model will generalize well or will be accurate on unseen before data.
In general, ML makes following assumptions about the data :
During past decades we have witnessed the power of visual information. As our life becoming more busy and more intense, we have less time to process and understand information. Many times people just relying on visual information without understanding how much misleading it could deliver in order to create different opinion about given information.
Famous writer and visual journalist Alberto Cairo recently published “The Truthful Art” where he emphasize the Five Qualities of Great Visualization which he encourage to keep in mind for everyone who deal with data visualization.
~ Is it truthful? Means based on through and honest research.
Diving deep in the world of Big Data we always looking for better tools to explore , operate and modify data. Any tool we would like to use will come with some advantages and disadvantages during the programming process.It is very important to understand when and how it is better to use python tools.
Lists are one one of the main built-in data structures in Python that can contain values of various data types.One of the most common operations on lists is “for loop” that can be easily replaced with list comprehension. …