Data Science With Excel

— The stages in data science consist of several stages, one of which is data preparation. At this stage, many things are done so that the dirty data becomes clean data that is ready for modeling. Many applications offer data science convenience in terms of processing data. One of them is excel, this application from Microsoft can perform data processing so that the data is ready for modeling. However, there are limitations in using excel. The maximum number of rows that excel has is only 1,048,576 and the number of columns is 16,384. However, if you process data of no more than 1 million rows, excel can still handle it by using features such as error detection, removing duplicate data, correcting error values, detecting outlier values, handling missing data and validating data. This study shows some of these features along with examples of their use.


I. INTRODUCTION
In the world of data science, a term called data cleaning is known. The definition of data cleaning itself is related to data quality. As already knows, the quality of the data greatly affects the results of the analysis. No matter how good and up-to-date the analysis is, if the quality is poor, even the results will not be satisfactory. Data quality can be ensured through a procedure called data cleaning. Data cleansing is the process of ensuring the accuracy, consistency, and usability of data in a data set [1] [2]. The secret is to detect data errors or corruption and correct or delete the data as needed. Combining multiple data sources at the same time can result in duplicate or mislabeled data. In this situations, data sanitization is also necessary to avoid more complex problems. Here are some reasons why data cleaning is mandatory: Eliminate errors and inconsistencies that arise when multiple data sources are collected on a single dataset. Improving work efficiency because this process will make it easier for developers and data processing teams to find what to expect based on the data. A lower error rate will also bring customer satisfaction and reduce the team's workload. Helps developers map several different data functions. This process will also make developers more familiar with the usefulness of the data and get to know where the data comes from.

II. RESEARCH METHODS
Data cleaning methods carried out with excel on this activity includes [3] : Detecting Errors; The first step that must be done is to monitor error or corrupt notifications [4]. Errors are sometimes not visible when editing a document. The error is only seen and feels annoying when printing the document because it turns out that there is an error output that appears. Actually, to find the error cell, you don't need to look at the contents of every cell contained in workshet. Errors can be found using the Go To facility found in Microsoft Excel.
Here's how to find cells that are error or still have errors: Press the key F5 or press the CTRL+G.  Then click the OK button, all error cells in the worksheet will be blocked. cells that have errors can be marked by giving color / fill to the cell, for example, it becomes yellow (Yellow).
Remove Duplicate Data Or Unnecessary Data One of the excel features that can be used to find double data in excel or the same data, namely data that has duplicates is the Conditional Formatting feature [5].
Conditional formatting is one of the features of Microsoft Excel that is used to format cells according to the value of the particular cell in question.
The easiest way to find out double data in excel or double and duplicate data in excel is to use the conditional formatting feature. Selection of the range of data you want to find duplicate data.  Then specify the desired cell format can be seen as in Figure 5. Fix Structure Errors; Excel basically groups data into numeric and text. Numeric is to the right of the cell while the text is to the left of the cell. Data that fall into the numerical category can then perform number operations [6].
Whereas text cannot be performed number operations. For example, in the phone number data, although the value is in the form of a number, it cannot be done number operations such as multiplied, divided, added or subtracted. Therefore, phone number data is classified as text data. To fix the data structure error, you can use 2 functions, namely replace and subtitue.
Unwanted Outlier Filter In the process of data processing, sometimes data appears that at first glance appears out of sync or far apart using other data. This is what is claimed to use outliers or outliers [7]. It's okay to remove the outlier found but include a clear reason. Because, outlier filtering is indeed able to help the performance of the data that is being worked on. Even so, keep in mind that the emergence of outliers does not mean that the theory being worked on is wrong [8]. Quite the contrary, the existence of an outlier can be used as an indicator to choose the validity of the data.
Presented a data as follows to see the outlier of the data that mungin has.   Next, calculate the absolute value of the standardized value in steps 5 and 6 by typing the formula in cell G6, namely: =ABS(F6). Copy cell F6 and paste cell F7 through F25 to determine whether the sample or observation is an outlier or not, then in cell H6, type the formula: =IF(G6>3,"*","").
Copy cell H6 and Paste in cell H7 to H25. Look at the results in Cell H7:H25, if there is a * sign, then the observation is an outlier. Handling lost data; Missing Value is the loss of some data that has been obtained. In the world of data science, missing value is closely related to the process of data disputes (data wrangling) before later data analysis and prediction will be carried out [9] [10]. Data wrangling is an activity of uniformizing data or cleaning data (cleaning data) from dirty (raw) data to data that will be ready to be used for analysis. The gross (raw) data in question is data that is indicated that there is still a uniformity of format, missing values appear in the data, and there are still additional suffixes, prefixes and others. Usually, a data scientist spends 60% of his time in carrying out this process. Because the facts show that 75% of the data owned by the company is gross data. In excel the process of correcting missing values can use linear interpolation techniques. When entered in excel the formula is as follows Figure 11. Interpolated position after searching using the formula Next enter the value for cell c6 with the formula c5 + B1 (the interpolation value found), then the result is as follows Figure 12. The process of inputting missing numbers by summing the smallest numbers with the interpolated values obtained Data Validation; The last step of data cleaning is validation. In microsoft excel limit the value or text entered in a cell or range in accordance with certain criteria that are desired. For example, a cell can only be filled with the numbers 1-10, limited to a certain list of text, can only be filled with a date format, and so on. For this kind of need, Microsoft Excel provides a feature called "Data Validation" or excel data validation [11].
How to make data validation by using data validation in excel to limit the contents of a cell and or excel range, is as follows: Select the Cell/Range for which you will set the validation data.
Select the Data Validation Menu on the Data--Group Data Tools Tab. The above steps can also be done with the keyboard shortcut Alt + A + V + V.
After the Data Validation option box appears, setting the data validation settings or limiting the contents of the desired cells.

III. RESULT AND ANALYSIS
As previously discussed, there are many factors that can affect the results of analysis in data science. One of the keys to success lies in the early stages, namely data preparation which takes the longest and is the most difficult to do. Data Preparation is the process of preparing data so that it is ready to be processed for the next stage [12]. Many things can be found at the data preparation stage that can hinder the successful implementation of data science. For example: poor data quality (missing value, duplicate data, and incomplete data) and unbalanced datasets (there is data that is too dominant, so data science activities cannot make predictions correctly) [5] [7]. Errors at the data preparation stage will affect the results data science analysis. In addition to data preparation, the data exploration stage is no less complex. This paper will focus on the data validation process with the features in Excel. Criteria Settings on Data Validation When viewed in more detail in figure 14 Visualization of the data validation menu, there is an allow menu. If clicked, it will bring up 8 validation criteria menus in excel. Menus that appear include, between: Only related data resides between two data settings that can inputted in the cell. Not between: Only in addition to the related data that is between the two data settings can be entered in the cell. equal to: Only data that is the same as in the settings can be inputted in the cell. Similar to the comparison operator "=". not equal to: Only data that is not the same as in the settings can be inputted in the cell. Similar to the comparison operator "<>". greater than: Only data larger than the data in the settings should be inputted in the cell. Similar to the comparison operator ">". less than: Only data smaller/less than the data in the settings is allowed to be inputted in the cell. Similar to the comparison operator "<". greater than or equal to: Only data greater than or equal to the data in the settings can be inputted on the cell. Similar to the comparison operator ">=". less than or equal to: Only data smaller/less or equal to that of the data in the setting can be inputted in the cell. Similar to the comparison operator "<=". Message input on data validation.
The INPUT MESSAGE settings tab in the "Data validation" option box is used to set the message displayed when a validated cell is selected.
By setting the Input Message, excel will display certain information when the validated cell is active.

Figure 17. Input message if there is an error
If the Show input message when cell is selected section is checked, then when the cell is validated we select, excel will automatically display the message according to what we are atuar or we write in the Title and Input message sections. And instead leave the default or clear the title and Input message or uncheck not to displays a message when performing data input.
The Title section is the title of the message while the input message is the content of the message to be displayed. If we set the validation data for cell A1 for example and the input message we set as above, then when cell A1 we select it, a message will appear as shown below: Figure 18. Error Message that appears when incorrectly validating error alerts in Excel Validation Data The ERROR ALERT settings tab in the "Data Validation" dialog box is used to set a warning message if the input data we entered in the validated cell does not match the settings we have set. Figure 19. Error message By activating the Error alert setting, the user will get information or a warning message if the data entered does not match the applied data validation.
For example, if you set a cell that can only be filled with numbers and then we input text into the cell, then this error warning message will appear automatically. If the data input is correct as per the enforced data validation then this message will not appear.
How to Clear Excel Validation Data; If you no longer need data validation, the ways to delete or eliminate data validation are as follows: Selection or select the cell that we will invalidate the data. Select the Data Validation menu on the Data--Group Data Tools Tab. Select Clear All and then click OK.

VI. CONCLUSION
The process of becoming a data scientis can be starting with studying the type of data and transofrmation of data according to the purpose of the analysis. Such stages can be done with multiple applications that will facilitate and maintain data reliability. One of the applications that can be used to carry out activities scientis data is excel. Lots of features excel that can be used for perform data analysis especially when data preparation process. Functions, formulas and existing menus can support a scientis data in his work. For data over 1 million in size data rows should use the app else because excel is only capable holds 1,048,576 rows and 16,384 column. Of course it will make it difficult for a person scientis data if it has to work with data which is more than 1 million if using excel.

THANK-YOU NOTE
We would like to thank Universitas 'Aisyiyah Surakarta (www.aiska-university.ac.id) through P3M for funding this research.