TOP SQL Interview Queries

part0001

TOP-50 MOST IMPORTANT SQL QUERIES

****

Introduction:

Knowing SQL can help you improve your skills as a developer.

SQL is incredibly powerful, and like every well-made development tool, it has a few commands which it’s vital for a good developer to know. Here is a list of SQL queries that are really important for coding & optimization. Each of the queries in our SQL tutorial is consequential to almost every system that interacts with an SQL database.

1. SQL Query for Retrieving Tables

This query can be run to retrieve the list of tables present in a database where the database is “My_Schema” .

With the SELECT command, users can define the columns that they want to get in the query output. This command is also useful to get which column users want to see as the output table. The output data is saved in a result table. This output table is also termed as the result-set.

1 . SELECT * FROM My_Schema.Tables;

2. Query for Selecting Columns from a Table

This is perhaps the most widely used SQL queries examples. In the example below, we are extracting the “Student_ID” column or attribute from the table “STUDENT”.

1 . SELECT Student_ID FROM STUDENT;

If you want to display all the attributes from a particular table, this is the right query to use:

2. SELECT * FROM STUDENT ;

3. Query for Outputting Data Using a Constraint

This SQL query retrieves the specified attributes from the table on the constraint Employee ID =0000

SELECT EMP_ID, NAME FROM EMPLOYEE_TBL WHERE EMP_ID = '0000' ;

4. Query for Outputting Sorted Data Using ‘Order By’

This query orders the results with respect to the attribute which is referenced to using “Order By” – so for example, if that attribute is an integer data type, then the result would either be sorted in ascending or descending order; likewise if the data type is a String then the result would be ordered in alphabetical order.

SELECT EMP_ID, LAST_NAME FROM EMPLOYEE
WHERE CITY = 'Seattle' ORDER BY EMP_ID;

The ordering of the result can also be set manually, using “asc ” for ascending and “desc” for descending.

Ascending (ASC) is the default condition for the ORDER BY clause. In other words, if users don’t specify ASC or DESC after the column name, then the result will be ordered in ascending order only.

1. SELECT EMP_ID, LAST_NAME FROM EMPLOYEE_TBL

2. WHERE CITY = 'INDIANAPOLIS' ORDER BY EMP_ID asc;

5. SQL Query for Outputting Sorted Data Using ‘Group By’

The ‘Group By’ property groups the resulting data according to the specified attribute.

The SQL query below will select Name, Age columns from Patients table, then will filter them by Age value to include records where Age is more than 40 and then will group records with similar Age value and then finally will output them sorted by Name.

1. SELECT Name, Age FROM Patients WHERE Age > 40

2. GROUP BY Name, Age ORDER BY Name;

Another sample of use of Group By: this expression will select records with a price lesser than 70 from Orders table, will group records with a similar price, will sort the output by price and will also add the column COUNT(price) that will display how many records with similar price were found:

1. SELECT COUNT (price), price FROM orders

2. WHERE price < 70 GROUP BY price ORDER BY price

Note: you should use the very same set of columns for both SELECT and GROUP BY commands, otherwise you will get an error. Many thanks to Sachidannad for pointing out!

SQL Queries for Data Manipulation Using Math Functions

There are a lot of built-in math functions like COUNT and AVG which provide basic functionalities of counting the number of results and averaging them respectively.

6. Data Manipulation Using COUN T

This query displays the total number of customers by counting each customer ID. In addition, it groups the results according to the country of each customer. In count, if users define DISTINCT, then they cal also define the query_partition_clause. This clause is a part of the analytic clause, and the other clauses such as order_by_clause and windowing_clause are not permitted.

Syntax: SELECT COUNT(colname) FROM table name;

SELECT COUNT (CustomerID), Country FROM Customers GROUP BY Country;

7. Data Manipulation Using SUM

SUM calculates the total of the attribute that is given to it as an argument.

SELECT SUM (Salary)FROM Employee WHERE Emp_Age < 30;

8. Data Manipulation Using AVG

Simple – an average of a given attribute.

SELECT AVG (Price)FROM Products;

9. SQL Query for Listing all Views

This SQL query lists all the views available in the schema.

SELECT * FROM My_Schema.views;

10. Query for Creating a View

A view is a tailored table that is formed as a result of a query. It has tables and rows just like any other table. It’s usually a good idea to run queries in SQL as independent views because this allows them to be retrieved later to view the query results, rather than computing the same command every time for a particular set of results.

CREATE VIEW Failing_Students AS
SELECT S_NAME, Student_ID
FROM STUDENT
WHERE GPA > 40;

11. Query for Retrieving a View

The standard syntax of selecting attributes from a table is applicable to views as well.

SELECT * FROM Failing_Students;

12. Query for Updating a View

This query updates the view named ‘Product List’ – and if this view doesn’t exist, then the Product List view gets created as specified in this query.

A view is a legitimate copy of a different table or sequence of tables. A view obtains its information or data from the tables from previously created tables known as base tables. Base tables are real tables. All procedures implemented on a view really modify the base table. Users can use views just like the real or base tables. In view, users can apply various DDL, DML commands such as update, insert into, and delete.

CREATE OR REPLACE VIEW [ Product List] AS
SELECT ProductID, ProductName, Category
FROM Products
WHERE Discontinued = No ;

13. Query for Dropping a View

This query will drop or delete a view named ‘V1’.

DROP VIEW V1;

14. Query to Display User Tables

A user-defined table is a representation of defined information in a table, and they can be used as arguments for procedures or user-defined functions. Because they’re so useful, it’s useful to keep track of them using the following query.

SELECT * FROM Sys.objects WHERE Type= ’ u ’

15. Query to Display Primary Keys

A primary key uniquely identifies all values within a table. A primary key imposes a NOT NULL restriction and a unique constraint in one declaration. In other words, it prevents various rows from having similar values or sequences of columns. It doesn’t allow null values .

The following SQL query lists all the fields in a table’s primary key.

SELECT * from Sys.Objects WHERE Type= 'PK'

16. Query for Displaying Unique Keys

A Unique Key allows a column to ensure that all of its values are different.

SELECT * FROM Sys.Objects WHERE Type= 'uq'

17. Displaying Foreign Keys

Foreign keys link one table to another – they are attributes in one table which refer to the primary key of another table.

SELECT * FROM Sys.Objects WHERE Type= 'f '

Primary, Unique, and Foreign are part of the constraints in SQL. Constraints are essential to the scalability, compliance, and sincerity of the data. Constraints implement particular rules, assuring the data adheres to the conditions outlined. For example, these are the laws imposed on the columns of the database tables. These are applied to restrict the kind of data in the table. This assures the efficiency and authenticity of the database.

18. Displaying Triggers

A Trigger is sort of an ‘event listener’ – i.e, it’s a pre-specified set of instructions that execute when a certain event occurs. The list of defined triggers can be viewed using the following query.

SELECT * FROM Sys.Objects WHERE Type= 'tr'

19. Displaying Internal Tables

Internal tables are formed as a by-product of a user-action and are usually not accessible. The data in internal tables cannot be manipulated; however, the metadata of the internal tables can be viewed using the following query.

SELECT * FROM Sys.Objects WHERE Type= 'it'

20. Displaying a List of Procedures

A stored procedure is a group of advanced SQL queries that logically form a single unit and perform a particular task. Thus, using the following query you can keep track of them:

SELECT * FROM Sys.Objects WHERE Type= 'p'

21. Swapping the Values of Two Columns in a table

In this and subsequent examples, we will use a common company database including several tables that are easily visualized. Our practice DB will include a Customer table and an Order table. The Customers table will contain some obvious columns including ID, Name, Address, zip, and email, for example, where we assume for now that the primary key field for indexing is the Customer_ID field.

With this in mind, we can easily imagine an Orders table which likewise contains the indexed customer ID field, along with details of each order placed by the customer. This table will include the order Number, Quantity, Date, Item, and Price. In our first one of SQL examples, imagine a situation where the zip and phone fields were transposed and all the phone numbers were erroneously entered into the zip code field. We can easily fix this problem with the following SQL statement:

UPDATE Customers SET Zip=Phone, Phone=Zip

22. Returning a Column of Unique Values

Now, suppose that our data entry operator added the same Customers to the Customers table more than once by mistake. As you know, proper indexing requires that the key field contains only unique values. To fix the problem, we will use SELECT DISTINCT to create an indexable list of unique customers:

SELECT DISTINCT ID FROM Customers

23. Making a Top 25 with the SELECT TOP Clause

Next, imagine that our Customers table has grown to include thousands of records, but we just want to show a sample of 25 of these records to demonstrate the column headings and The SELECT TOP clause allows us to specify the number of records to return, like a Top-25 list. In this example we will return the top 25 from our Customers table:

SELECT TOP 25 FROM Customers WHERE Customer_ID<>NULL;

24. Searching for SQL Tables with Wildcards

Wildcard characters or operators like “%” make it easy to find particular strings in a large table of thousands of records. Suppose we want to find all of our customers who have names beginning with “Herb” including Herberts, and Herbertson. The % wildcard symbol can be used to achieve such a result. The following SQL query will return all rows from the Customer table where the Customer_name field begins with “Herb”:

SELECT * From Customers WHERE Name LIKE 'Herb%'

25. Between Monday and Tuesday

Today is Wednesday, and we arrive at work and discover that our new data entry clerk in training has entered all new orders incorrectly on Monday and Tuesday. We wish to teach our new trainee to find and correct all erroneous records. What’s the easiest way to get all the records from the Orders table entered on Monday and Tuesday? The Between clause makes the task a breeze:

1. SELECT ID FROM Orders WHERE

2. Date BETWEEN ‘01/12/2018’ AND ‘01/13/2018 ’

26. Finding the Intersection of Two Tables

Undoubtedly the whole reason that a relational database exists in the first place is to find matching records in two tables! The JOIN statement accomplishes this core objective of SQL and makes the task easy. Here we are going to fetch a list of all records which have matches in the Customers and Orders tables:

SELECT ID FROM Customers INNER
JOIN Orders ON Customers.ID = Orders.ID

The point of INNER JOIN, in this case, is to select records in the Customers table which have a matching customer ID values in the Orders table and return only those records. Of course, there are many types of JOIN, such as FULL, SELF, and LEFT, but for now, let’s keep things interesting and move on to more diverse types of SQL queries .

27. Doubling the Power with UNION

We can combine the results of two SQL queries examples into one naturally with the UNION keyword. Suppose we want to create a new table by combining the Customer_name and phone from Customers with a list of that customer’s recent orders so that we can look for patterns and perhaps suggest future purchases. Here is a quick way to accomplish the task:

SELECT phone FROM Customers
UNION SELECT item FROM Orders

The UNION keyword makes it possible to combine JOINS and other criteria to achieve very powerful new table generation potential.

28. Making Column Labels More Friendly

Aliasing column labels give us the convenience of renaming a column label to something more readable. There is a tradeoff when naming columns to make them succinct results in reduced readability in subsequent daily use. In our Orders table, the item column contains the description of purchased products. Let’s see how to alias the item column to temporarily rename it for greater user-friendliness:

SELECT Item AS item_description FROM Orders

29. Always and Everywhere!

Wouldn’t it be great if there were a set of conditions you could depend on every time? The SQL queries using ANY and ALL can make this ideal a reality! Let’s look at how the ALL keyword is used to include records only when a set of conditions is true for ALL records. In the following example, we will return records from the Orders table where the idea is to get a list of high volume orders for a given item, in this case for customers who ordered more than 50 of the product:

SELECT Item FROM Orders
WHERE id = ALL
(SELECT ID FROM Orders
WHERE quantity > 50)

30. Writing Developer Friendly SQL

An often overlooked but very important element of SQL scripting is adding comments to a script of queries to explain what it’s doing for the benefit of future developers who may need to revise and update your SQL queries.

A SQL script is a collection of SQL elements and commands accumulated as a file in SQL Scripts. This script file can include many SQL commands or PL/SQL codes. One can utilize SQL Scripts to build, edit, design, execute and delete files.

The — single line and the /* .. */ multi-line delimiters empower us to add useful comments to scripts, but this is also used in another valuable way. Sometimes a section of code may not be in use, but we don’t want to delete it, because we anticipate using it again. Here we can simply add the comment delimiter to deactivate it momentarily:

/* This query below is commented so it won't execute*/
/*
SELECT item FROM Orders
WHERE date ALL = (SELECT Order_ID FROM Orders
WHERE quantity > 50)
*/

/* the SQL query below the will be executed
ignoring the text after "--"
*/

SELECT item -- single comment
FROM Orders -- another single comment
WHERE id
ALL = (SELECT ID FROM Orders
WHERE quantity > 25)

31. SQL queries for Database Management

So far we have explored SQL query examples for querying tables and combining records from multiple queries. Now it’s time to take a step upward and look at the database on a structural level. Let’s start with the easiest SQL statement of all which creates a new database. Here, we are going to create the DB as a container for our Customers and Orders tables used in the previous ten examples above:

CREATE DATABASE AllSales

32. Adding Tables to Our New DB

Next, we will actually add the Customers table which we’ve been using in previous examples, and then add some of the column labels which we are already familiar with:

CREATE TABLE Customers (
ID varchar (80),
Name varchar (80),
Phone varchar (20),
....
);

Although most databases are created using a UI such as Access or OpenOffice, it is important to know how to create and delete databases and tables programmatically via code with SQL statements. This is especially so when installing a new web app and the UI asks new users to enter names for DBs to be added during installation.

33. Modifying and Deleting Tables with SQL

The ALTER statement is used to modify or change the meaning of a table. In the case of the relational tables with columns, ALTER statement is used to update the table to the new or modified rules or definition. Alter belongs to the DDL category of Commands. Data definition language can be described as a pattern for commands through which data structures are represented.

Imagine that you decide to send a birthday card to your customers to show your appreciation for their business, and so you want to add a birthday field to the Customers table. In these SQL examples, you see how easy it is to modify existing tables with the ALTER statement:

ALTER TABLE Customers ADD Birthday varchar(80)

If a table becomes corrupted with bad data you can quickly delete it like this:

DROP TABLE table_name

34. The Key to Successful Indexing

An index is a schema element that includes a record for each content that arrives in the indexed column of the database table or cluster and gives a high-speed path to rows. There are many types of indexes such as Bitmap indexes, Partitioned indexes, Function-based indexes, and Domain indexes.

Accurate indexing requires that the Primary Key column contains only unique values for this purpose. This guarantees that JOIN statements will maintain the integrity and produce valid matches. Let’s create our Customers table again and establish the ID column as the Primary Key:

CREATE TABLE Customers (
ID int NOT NULL,
Name varchar(80) NOT NULL,
PRIMARY KEY (ID)
);

We can extend the functionality of the Primary Key so that it automatically increments from a base. Change the ID entry above to add the AUTO_INCREMENT keyword as in the following statement:

ID int NOT NULL AUTO_INCREMENT

35. Advanced Concepts For Improving Performance

Whenever practical, is always better to write the column name list into a SELECT statement rather than using the * delimiter as a wildcard to select all columns. SQL Server has to do a search and replace operation to find all the columns in your table and write them into the statement for you (every time the SELECT is executed). For example:

SELECT * FROM Customers

Would actually execute much faster on our database as:

1. SELECT Name , Birthday, Phone,

2. Address, Zip FROM Customer s

Performance pitfalls can be avoided in many ways. For example, avoid the time sinkhole of forcing SQL Server to check the system/master database every time by using only a stored procedure name, and never prefix it with SP_. Also setting NOCOUNT ON reduces the time required for SQL Server to count rows affected by INSERT, DELETE, and other commands. Using INNER JOIN with a condition is much faster than using WHERE clauses with conditions. We advise developers to learn SQL server queries to an advanced level for this purpose. For production purposes, these tips may be crucial to adequate performance. Notice that our tutorial examples tend to favor the INNER JOIN.

36. Conditional Subquery Results

The SQL operator EXISTS tests for the existence of records in a subquery and returns a value TRUE if a subquery returns one or more records. Have a look at this query with a subquery condition:

SELECT Name FROM Customers WHERE EXISTS
(SELECT Item FROM Orders
WHERE Customers.ID = Orders.ID AND Price < 50)

In this example above, the SELECT returns a value of TRUE when a customer has orders valued at less than $50.

37. Copying Selections from Table to Table

There are a hundred and one uses for this SQL tool. Suppose you want to archive your yearly Orders table into a larger archive table. This next example shows how to do it.

INSERT INTO Yearly_Orders
SELECT * FROM Orders
WHERE Date <=1/1/2018

This example will add any records from the year 2018 to the archive.

38. Catching NULL Results

The NULL is the terminology applied to describe an absent value. Null does not mean zero. A NULL value in a column of a table is a condition in a domain that seems to be empty. A column with a NULL value is a domain with absent value. It is essential to recognize that a NULL value is distinct from a zero.

In cases where NULL values are allowed in a field, calculations on those values will produce NULL results as well. This can be avoided by the use of the IFNULL operator. In this next example, a value of zero is returned rather than a value of NULL when the calculation encounters a field with a NULL value:

SELECT Item, Price *
(QtyInStock + IFNULL(QtyOnOrder, 0))
FROM Orders

39. HAVING can be Relieving!

The problem was that the SQL WHERE clause could not operate on aggregate functions. The problem was solved by using the HAVING clause. As an example, this next query fetches a list of customers by the region where there is at least one customer per region :

SELECT COUNT (ID), Region

FROM Customers

GROUP BY Region

HAVING COUNT (ID) > 0;

40. Tie things up with Strings!

Let’s have a look at processing the contents of field data using functions. Substring is probably the most valuable of all built-in functions. It gives you some of the power of Regex, but it’s not so complicated as Regex. Suppose you want to find the substring left of the dots in a web address. Here’s how to do it with an SQL Select query:

SELECT SUBSTRING_INDEX( "www.bytescout.com" , " . " , 2);

This line will return everything to the left of the second occurrence of “. ” and so, in this case, it will return

< a href ="https://bytescout.com" >www.bytescout.com</ a >

41. Use COALESCE to return the first non-null expression

The SQL Coalesce is used to manage the NULL values of the database. In this method, the NULL values are substituted with the user-defined value. The SQL Coalesce function assesses the parameters in series and always delivers first non-null value from the specified argument record.

For example,

Syntax:

SELECT COALESCE (NULL,NULL,'ByteScout' ,NULL,'Byte' )

Output

ByteScout

42. Use Convert to transform any value into a particular datatype

This is used to convert a value into a defined datatype. For example, if you want to convert a particular value into int datatype then convert function can be used to achieve this. For example,

SELECT CONVERT (int , 27.64)

Output

43. DENSE_RANK()Analytical query

It is an analytic query that computes the rank of a row in an arranged collection of rows. An output rank is a number starting from 1. DENSE_RANK is one of the most important analytic SQL queries. It returns rank preferences as sequential numbers. It does not jump rank in event of relations. For example, the following query will give the sequential ranks to the employee.

SELECT eno,
dno,
salary,
DENSE_RANK() OVER (PARTITION BY dno ORDER BY salary) AS ranking
FROM employee;

ENO	DNO	SALARY	RANKING
7933	10	1500	1
7788	10	2650	2
7831	10	6000	3
7362	20	900	1
7870	20	1200	2
7564	20	2575	3
7784	20	4000	4
7903	20	4000	4
7901	30	550	1
7655	30	1450	2
7522	30	1450	2
7844	30	1700	3
7493	30	1500	4
7698	30	2850	5

44. Query_partition_clause

The query_partition_clause breaks the output set into distributions, or collections, of data. The development of the analytic query is limited to the confines forced by these partitions, related to the process a GROUP BY clause modifies the performance of an aggregate function. If the query_partition_clause is eliminated, the entire output collection is interpreted as a separate partition.

The following query applies an OVER clause, so the average displayed is based on all the records of the output set.

1) SELECT eno, dno, salary,

2) AVG (salary) OVER () AS avg_sal

3) FROM employe

part0002

EO	DNO	SALARY	AVG	SAL
7364	20	900	2173	21428
7494	30	1700	2173	21428
7522	30	1350	1350	21428
7567	20	3075	2173	21428
7652	30	1350	2173	21428
7699	30	2950	2173	21428
7783	10	2550	2173	21428
7789	20	3100	2173	21428
7838	10	5100	2173	21428
7845	30	1600	2173	21428
7877	20	1200	2173	21428
7901	30	1050	2173	21428
7903	20	3100	2173	21428
7935	10	1400	2173	21428

45. Finding the last five records from the table

Now, if you want to fetch the last eight records from the table then it is always difficult to get such data if your table contains huge information. For example, you want to get the last 8 records from the employee table then you can use rownum and a union clause.

For example,

1) Select * from Employee A where rownum <=8

2) union

select * from (Select * from Employee A order by rowid desc ) where rownum <=8;

The above SQL query will give you the last eight records from the employee table where rownum is a pseudo column. It indexes the data in an output set .

46. LAG

The LAG is applied to get data from a prior row. This is an analytical function. For example, the following query gives the salary from the prior row to compute the difference between the salary of the current row and that of the prior row. In this query, the ORDER BY of the LAG function is applied. The default is 1 if you do not define offset. The arbitrary default condition is given if the offset moves past the range of the window. The default is null if you do not define default.

1) SELECT dtno,

2) eno,

3) emname,

4) job,

5) salary,

6) LAG(sal, 1, 0) OVER (PARTITION BY dtno ORDER BY salary) AS salary_prev

7) FROM employ

part0003

Output

DTNO	ENO	ENAME	JOB	SAL	PREV
10	7931	STEVE	CLERK	1300	0
10	7783	JOHN	MANAGER	2450	1300
10	7834	KING	PRESIDE	5000	2450
20	7364	ROBIN	CLERK	800	1)
20	7876	BRIAN	CLERK	1100	800
20	7567	SHANE	MANAGER	2975	1100
20	7784	SCOTT	ANALYST	3000	2975
20	7908	KANE	ANALYST	3000	3000
30	7900	JAMES	CLERK	950	0
30	7651	CONNER	SALESMAN	1250	950
30	7522	MATTHEW	SALESMAN	1250	1250
30	7843	VIVIAN	SALESMAN	1500	1250
30	7494	ALLEN	SALESMAN	1600	1500
30	7695	GLEN	MANAGER	2850	1600

47. LEAD

The LEAD is also an analytical query that is applied to get data from rows extra down the output set. The following query gives the salary from the next row to compute the deviation between the salary of the prevailing row and the subsequent row. The default is 1 if you do not define offset. The arbitrary default condition is given if the offset moves past the range of the window. The default is null if you do not define default.

SELECT eno,
empname,
job,
salary,
LEAD(salary, 1, 0) OVER (ORDER BY salary) AS salary_next,
LEAD(salary, 1, 0) OVER (ORDER BY salary) - salary AS salary_diff
FROM employee;

ENO – EMPNAME – JOB – SALARY – SALARY_NEXT – SALARY_DIFF

ENO EMPNAME JOB SALARY SALARY_NEXT SALARY_DIFF

---------- ---------- --------- ----------

7369 STEVE CLERK 800 950 150

7900 JEFF CLERK 950 1100 150

7876 ADAMS CLERK 1100 1250 150

7521 JOHN SALESMAN 1250 1250 0

7654 MARK SALESMAN 1250 1300 5 0

7934 TANTO CLERK 1300 1500 200

7844 MATT SALESMAN 1500 1600 100

7499 ALEX SALESMAN 1600 2450 850

7782 BOON MANAGER 2450 2850 400

7698 BLAKE MANAGER 2850 2975 125

7566 JONES MANAGER 2975 3000 25

7788 SCOTT ANALYST 3000 3000 0

7902 FORD ANALYST 3000 5000 2000

7839 KING PRESIDENT 5000 0 -5000

part0004

48. PERCENT_RANK

The PERCENT_RANK analytic query. The ORDER BY clause is necessary for this query. Excluding a partitioning clause from the OVER clause determines the entire output set is interpreted as a separate partition. The first row of the standardized set is indicated 0 and the last row of the set are indicated 1. For example, the SQL query example gives the following output.

SELEC T

prdid, SUM(amount),

PERCENT_RANK() OVER (ORDER BY SUM(amount) DESC) AS percent_rank

FROM sales

GROUP BY prdid

ORDER BY prdid;

PRDID SUM (AMOUNT) PERCENT_RANK

----------- ----------- ------------

1 22623.5 0

2 223927.08 1

49. MIN

Utilizing a blank OVER clause converts the MIN into an analytic function. This is also an analytical query. In this, the entire result set is interpreted as a single partition. It gives you the minimum salary for all employees and their original data. For example, the following query is displaying the use of MIN in the Select query.

SELECT eno,

empname,

dtno ,

salary,

MIN(salary) OVER (PARTITION BY dtno) AS min_result

FROM employee;

ENO EMPNAME DTNO SALARY MIN_RESULT

---------- ---------- ---------- ---------- ---------

ENO	EMPNAME	DTNO	SALARY MIN	RESULT
7782	CLARK	10	2450	1300
7839	KING	10	5000	1300
7934	MILLER	10	1300	1300
7566	JONES	20	2975	800
7902	FORD	20	3000	800
7876	ADAMS	20	1100	800
7369	SMITH	20	800	800
7788	SCOTT	20	3000	800
7521	WARD	30	1250	950
7844	TURNER	30	1500	950
7499	ALLEN	30	1600	950
7900	JAMES	30	950	950
7698	BLAKE	30	2850	950
7654	MARTIN	30	1250	95 0

50. MAX

Using a blank row OVER clause converts the MAX into an analytic function. The lack of a partitioning clause indicates the entire output set is interpreted as a separate partition. This gives the maximum salary for all employees and their original data. For example, the following query displays the use of MAX in the select query.

SELECT eno,

empname,
dtno,
salary,
MAX(salary) OVER () AS max_result
FROM employee;

ENO EMPNAME DTNO SALARY MAX_RESULT

---------- ---------- ---------- ---------- ----------

7369 SMITH 20 800 3000

7499 ALLEN 30 1600 3000

7521 WARD 30 1250 3000

7566 JONES 20 2975 3000

7654 MARTIN 30 1250 300 0

7698 BLAKE 30 2850 3000

7782 CLARK 10 2450 3000

7788 SCOTT 20 3000 3000

7839 KING 10 5000 3000

7844 TURNER 30 1500 3000

7876 ADAMS 20 1100 3000

7900 JAMES 30 950 3000

7902 FORD 20 3000 3000

7934 MILLER 10 1300 300 0

-Thanks -

Interview Questions and Answers

Interview Preparation - Shell Scripting, Java, Autosys, SDLC

Shell Scripts
AutoSys & Job Scheduling
Java Programming
Python Programs
Unix, Git, CI/CD, SDLC

Shell Scripts

1. Find Second Largest + Second Smallest and Add Them


#!/bin/bash
arr=(5 1 9 6 1 2 8)
sorted=($(printf "%s\n" "${arr[@]}" | sort -n | uniq))
second_smallest=${sorted[1]}
second_largest=${sorted[-2]}
sum=$((second_smallest + second_largest))
echo "Second Smallest: $second_smallest"
echo "Second Largest: $second_largest"
echo "Sum: $sum"

2. Find Unique Characters and Longest Substring Without Repetition


#!/bin/bash
input="abcabcbb"
longest=""
temp=""
for (( i=0; i<${#input}; i++ )); do
    char="${input:i:1}"
    if [[ "$temp" == *"$char"* ]]; then
        temp="${temp#*$char}${char}"
    else
        temp="$temp$char"
    fi
    if [ ${#temp} -gt ${#longest} ]; then
        longest="$temp"
    fi
done
echo "Longest substring without repetition: $longest"

3. Reverse a String with Tests


reverseStr() {
  local str="$1"
  echo "$str" | rev
}

doTestsPass() {
  local result=0
  if [[ "$(reverseStr 'abcd')" == "dcba" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr 'odd abcde')" == "edcba ddo" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr 'even abcde')" == "edcba neve" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr "$(reverseStr 'no change')")" == "no change" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr '')" == "" ]]; then result=$((result + 1)); fi

  if [[ $result -eq 5 ]]; then
    echo "All tests pass"
  else
    echo "There are test failures"
  fi
}
doTestsPass

4. Print Line 50 to 60


sed -n '50,60p' filename.txt

5. Search Pattern and Count Occurrence


grep -o "pattern" filename.txt | wc -l

6. Find Dot Product of Two Arrays


#!/bin/bash
a=(1 2 3)
b=(4 5 6)
dot_product=0
for i in "${!a[@]}"; do
  dot_product=$((dot_product + a[i]*b[i]))
done
echo "Dot Product: $dot_product"

7. Check and Print Integers from a String


#!/bin/bash
str="abc123def45gh6"
echo "$str" | grep -o '[0-9]\+'

8. Find 3rd Least Array Value


#!/bin/bash
arr=(5 2 8 1 7 9)
sorted=($(printf "%s\n" "${arr[@]}" | sort -n | uniq))
echo "Third Least: ${sorted[2]}"

9. Check Process Status


#!/bin/bash
process="sshd"
if pgrep "$process" > /dev/null
then
  echo "$process is running"
else
  echo "$process is not running"
fi

10. Delete Blank Lines


sed -i '/^$/d' filename.txt

11. Lines Greater Than 5 Characters (awk)


awk 'length($0) > 5' filename.txt

12. Print Last 10 Lines


tail -n 10 filename.txt

13. Check Palindrome


#!/bin/bash
read -p "Enter a string: " str
rev_str=$(echo "$str" | rev)
if [[ "$str" == "$rev_str" ]]; then
    echo "Palindrome"
else
    echo "Not Palindrome"
fi

14. Check File Exists, Size, Send Email


#!/bin/bash
file="yourfile.txt"
if [[ -f "$file" ]]; then
    size=$(stat -c%s "$file")
    echo "File size: $size bytes"
    if (( size > 0 )); then
        echo "File exists and is non-empty" | mail -s "File Status" your@email.com
    else
        echo "File is empty" | mail -s "File Empty Alert" your@email.com
    fi
else
    echo "File not found" | mail -s "File Not Found" your@email.com
fi

AutoSys & Job Scheduling

On Hold vs On Ice

On Hold: Job is manually stopped. Needs manual release.
On Ice: Job is skipped this time but will run in next schedule automatically.

Predecessor and Successor

Predecessor: Job that must complete first.
Successor: Job that runs after predecessor finishes.

Job Schedule Example (Cron)


0 10 * * 1-5 /path/to/your/script.sh
# Runs Monday to Friday at 10 AM

Migration Testing Points

Validate job dependencies.
Check environment variables.
Check file watchers, time triggers.
Validate calendars, schedules.
Test job logs and outputs.
Test email alerts.

Java Programming

Kill Process Tree


import java.util.*;
public class KillProcess {
    public static void main(String[] args) {
        int[] pid = {1, 3, 10, 5};
        int[] ppid = {3, 0, 5, 3};
        int kill = 5;
        List result = new ArrayList<>();
        Map> map = new HashMap<>();
        for (int i = 0; i < ppid.length; i++) {
            map.computeIfAbsent(ppid[i], k -> new ArrayList<>()).add(pid[i]);
        }
        Queue queue = new LinkedList<>();
        queue.add(kill);
        while (!queue.isEmpty()) {
            int curr = queue.poll();
            result.add(curr);
            if (map.containsKey(curr)) {
                queue.addAll(map.get(curr));
            }
        }
        System.out.println(result);
    }
}

Find Odd and Even Numbers


int[] arr = {2,3,4,5,6,3,7};
for(int i=0; i



Character Frequency Count

String str = "abcdascgab";
Map map = new HashMap<>();
for(char c : str.toCharArray()) {
    map.put(c, map.getOrDefault(c,0)+1);
}
map.forEach((k,v) -> System.out.println(k + " -> " + v));


Longest Occurrence of Character

public class LongestChar {
    public static void main(String[] args) {
        String s = "aaabbccccdde";
        int maxLen = 0, currLen = 1, start = 0, maxStart = 0;
        for (int i = 1; i < s.length(); i++) {
            if (s.charAt(i) == s.charAt(i-1)) {
                currLen++;
            } else {
                if (currLen > maxLen) {
                    maxLen = currLen;
                    maxStart = i - currLen;
                }
                currLen = 1;
            }
        }
        if (currLen > maxLen) {
            maxLen = currLen;
            maxStart = s.length() - currLen;
        }
        System.out.println("Character: " + s.charAt(maxStart));
        System.out.println("Starting Index: " + maxStart);
        System.out.println("Length: " + maxLen);
    }
}




Python Programs

Find Missing Characters to Form Pangram

import string
def missing_chars(input_str):
    input_str = input_str.lower()
    missing = [ch for ch in string.ascii_lowercase if ch not in input_str]
    return ''.join(missing)
print(missing_chars("the quick brown fox jumps over the dog"))


First Non-Repeating Character

def first_unique_char(s):
    for c in s:
        if s.count(c) == 1:
            return c
    return None
print(first_unique_char("apple"))




Unix, Git, CI/CD, SDLC Knowledge

Unix Commands

cd: change directory
pwd: print working directory
ls: list files
mkdir: create directory
rm: remove files
sed: text manipulation
awk: pattern scanning and processing


CI/CD Pipeline Flow

Code Checkout ➔ Build ➔ Test ➔ Deploy ➔ Monitor
Tools: Git, Jenkins, Nexus, SonarQube


SDLC Phases

Planning
Analysis
Design
Development
Testing
Deployment
Maintenance


Git Rebase
Move branch commits to a new base, to make history linear.

Merge Conflict Resolution

Pull changes.
Edit conflicts manually.
Mark resolved.
Commit.







                            Read More





ALTER Keyword usage












  
    
      Use
      Syntax
      Example
    
  
  
    
      Rename table
      ALTER TABLE old_table_name RENAME TO new_table_name;
      ALTER TABLE employees RENAME TO staff;
    
    
      Add column
      ALTER TABLE table_name ADD column_name datatype;
      ALTER TABLE employees ADD salary DECIMAL(10,2);
    
    
      Drop column
      ALTER TABLE table_name DROP COLUMN column_name;
      ALTER TABLE employees DROP COLUMN salary;
    
    
      Rename column
      ALTER TABLE table_name RENAME COLUMN old_column_name TO new_column_name;
      ALTER TABLE employees RENAME COLUMN name TO full_name;
    
    
      Modify datatype
      ALTER TABLE table_name MODIFY column_name new_datatype; (MySQL)
      ALTER TABLE employees MODIFY salary BIGINT;
    
    
      Add primary key
      ALTER TABLE table_name ADD PRIMARY KEY (column_name);
      ALTER TABLE employees ADD PRIMARY KEY (employee_id);
    
    
      Drop primary key
      ALTER TABLE table_name DROP PRIMARY KEY;
      ALTER TABLE employees DROP PRIMARY KEY;
    
    
      Add foreign key
      ALTER TABLE child_table ADD CONSTRAINT fk_name FOREIGN KEY (child_column) REFERENCES parent_table(parent_column);
      ALTER TABLE orders ADD CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(id);
    
    
      Drop foreign key
      ALTER TABLE table_name DROP FOREIGN KEY fk_name;
      ALTER TABLE orders DROP FOREIGN KEY fk_customer;
    
  







                            Read More
                          















SQL and Database Interview Questions and Answers














    
    
    SQL and Database Interview Questions and Answers


    SQL and Database Interview Questions and Answers

    1. What is normalization?
    Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity.

    2. Difference between TRUNCATE and DELETE
    TRUNCATE removes all rows instantly without logging individual row deletions and cannot be rolled back (DDL). DELETE removes rows one by one and can be rolled back (DML).

    3. Explain DDL and DML
    DDL (Data Definition Language) defines schema (CREATE, ALTER). DML (Data Manipulation Language) manipulates data (SELECT, INSERT, UPDATE, DELETE).

    4. What is an index?
    An index improves the speed of data retrieval operations on a table.

    5. Explain joins
    Joins combine rows from two or more tables based on a related column. Types: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN.

    6. Definition of SQL, DDL, JOIN
    SQL is Structured Query Language. DDL defines schema structure (e.g., CREATE). JOIN combines rows from two tables.

    7. What is a primary key and its advantages?
    A primary key uniquely identifies each record in a table and ensures entity integrity, preventing NULLs and duplicates.

    8. Write a query for employee-manager hierarchy
    SELECT m.empname AS Manager, e.empname AS Employee
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id;

    9. How to sum two values without using functions?
    SELECT col1 + col2 AS total FROM table_name;

    10. What are constraints in SQL? Why are they used?
    Constraints enforce rules on data columns: NOT NULL, UNIQUE, PRIMARY KEY, FOREIGN KEY, CHECK, DEFAULT.

    11. Fetch duplicate records from a table
    SELECT column1, COUNT(*) FROM table GROUP BY column1 HAVING COUNT(*) > 1;

    12. Write SQL to get the third highest salary
    SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 2;

    13. Get manager name using self join
    SELECT e.name AS Employee, m.name AS Manager 
FROM employees e 
JOIN employees m ON e.manager_id = m.emp_id;

    14. What is a view?
    A view is a virtual table based on a SQL query. It does not store data itself.

    15. What is a materialized view? Difference from a normal view?
    A materialized view stores the query result physically and is refreshed periodically, unlike a regular view which fetches data dynamically.

    16. Can we update a view?
    Yes, if the view is based on a single table without aggregations or GROUP BY.

    17. Can we use DELETE without WHERE clause?
    Yes, but it will delete all rows from the table. Use cautiously.

    18. Reasons for SQL procedure delay?
    Could include locks, missing indexes, large data volumes, suboptimal query plans, outdated statistics.

    19. What are triggers?
    Triggers are automatic actions fired in response to INSERT, UPDATE, or DELETE events on a table.

    20. How to establish DB connection?
    Use a connection string with hostname, port, database name, username, password, and driver details (JDBC/ODBC).

    21. What is EXPLAIN PLAN?
    EXPLAIN PLAN shows the execution path a SQL query will follow, helping to optimize performance.

    22. Difference between DELETE, DROP, and TRUNCATE
    DELETE removes specific rows, DROP removes the table itself, TRUNCATE removes all rows quickly without logging each removal.

    23. Use of views?
    Views simplify query complexity, enhance security, and help in modular database design.

    24. Types of joins in Oracle?
    Oracle supports INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and old-style (+) joins.

    25. Write query where empid = managerid
    SELECT empname FROM employee WHERE empid = managerid;

    26. What are JDBC parameters?
    Include database URL, username, password, driver class, and optional settings like timeouts and SSL properties.

    27. What is SQLException?
    SQLException is thrown when there is an issue accessing the database, like syntax error, connection failure, constraint violation, etc.









                            Read More
                          















Interview Questions and Answers














  
  
  Interview Questions and Answers


  All Interview Questions and Answers

  Shell Script Questions

  1. Find Second Largest and Second Smallest and Add
  arr=(12 13 3 4 1 6 9 17 13)
unique_arr=($(echo "${arr[@]}" | tr ' ' '\n' | sort -n | uniq))
second_smallest=${unique_arr[1]}
second_largest=${unique_arr[-2]}
sum=$((second_smallest + second_largest))
echo "Sum: $sum"

  2. Reverse a String
  reverseStr() {
  local str="$1"
  echo "$str" | rev
}

doTestsPass() {
  local result=0

  if [[ "$(reverseStr 'abcd')" == "dcba" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr 'odd abcde')" == "edcba ddo" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr 'even abcde')" == "edcba neve" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr "$(reverseStr 'no change')")" == "no change" ]]; then result=$((result + 1)); fi
  if [[ "$(reverseStr '')" == "" ]]; then result=$((result + 1)); fi

  if [[ $result -eq 5 ]]; then
    echo "All tests pass"
  else
    echo "There are test failures"
  fi
}

doTestsPass

  3. Find Unique Characters and Print Longest Possible Substring
  input="tisishatrecupttt"
declare -A freq
for ((i=0; i<${#input}; i++)); do
  char="${input:$i:1}"
  freq[$char]=$(( ${freq[$char]} + 1 ))
done

unique=""
for c in $(echo ${!freq[@]}); do
  if [[ ${freq[$c]} -eq 1 ]]; then
    unique+=$c
  fi
done

echo "Unique characters substring: $unique"

  4. Print Line 50 to 60 of a File
  sed -n '50,60p' filename.txt

  5. Search Pattern and Count Frequency
  grep -o 'pattern' filename.txt | wc -l

  6. Shell Script: Palindrome Check
  str="$1"
rev_str=$(echo "$str" | rev)
if [[ "$str" == "$rev_str" ]]; then
  echo "Palindrome"
else
  echo "Not Palindrome"
fi

  7. Check File Exists and Size and Send Email
  if [[ -f "$file" ]]; then
  if [[ $(stat -c%s "$file") -gt 0 ]]; then
    echo "File exists and not empty" | mail -s "File Alert" user@example.com
  fi
fi

  8. Delete Blank Lines in a File
  sed -i '/^$/d' filename.txt

  9. AWK: Print Lines Greater Than Length 5
  awk 'length($0) > 5' filename.txt

  10. Print Last 10 Lines
  tail -n 10 filename.txt

  AutoSys Related Questions

  On Hold vs On Ice
  
    On Hold: Job will not run until manually released. Successor jobs do not run.
    On Ice: Job will not run but successors can run as if it succeeded.
  

  JIL File
  Job Information Language (JIL) is used to define AutoSys jobs with attributes such as command, machine, owner, start times, etc.

  Predecessor/Successor Concept
  Predecessor: Job that must finish before another can start.
Successor: Job that depends on the predecessor.

  Job Schedule
  Defined using "start_times", "days_of_week", "run_calendar" attributes in AutoSys.

  Java and Python Coding

  1. Find Odd and Even Numbers
  int[] arr = {2, 3, 4, 5, 6, 3, 7};
for (int num : arr) {
    if (num % 2 == 0) System.out.println(num + " Even");
    else System.out.println(num + " Odd");
}

  2. Kill Process Problem
  Map<Integer, List<Integer>> tree = new HashMap<>();
for (int i = 0; i < ppid.length; i++) {
    tree.computeIfAbsent(ppid[i], k -> new ArrayList<>()).add(pid[i]);
}

Queue<Integer> queue = new LinkedList<>();
queue.add(kill);
while (!queue.isEmpty()) {
    int current = queue.poll();
    result.add(current);
    if (tree.containsKey(current)) {
        queue.addAll(tree.get(current));
    }
}

  3. Find Frequency of Characters
  String s = "abcdascgab";
Map<Character, Integer> freq = new HashMap<>();
for (char c : s.toCharArray()) {
    freq.put(c, freq.getOrDefault(c, 0) + 1);
}

  4. Remove Lines with Null 6th Field
  BufferedReader reader = new BufferedReader(new FileReader("input.txt"));
BufferedWriter writer = new BufferedWriter(new FileWriter("output.txt"));
String line;
while ((line = reader.readLine()) != null) {
    String[] fields = line.split(",");
    if (fields.length >= 6 && fields[5] != null && !fields[5].isEmpty()) {
        writer.write(line);
        writer.newLine();
    }
}
reader.close();
writer.close();

  Conceptual Questions

  Git
  
    Rebase: Reapply commits on top of another branch.
    Merge Conflict: Happens when two branches changed same line; resolve manually and commit.
  

  Unix
  
    AWK: Text processing and extraction tool.
    Extract 20th line: sed -n '20p' filename.txt
  

  SDLC Process
  Phases: Planning, Analysis, Design, Development, Testing, Deployment, Maintenance.

  CI/CD Pipeline
  Build - Test - Deploy automatically on code changes using tools like Jenkins, GitHub Actions, etc.

  Logical Questions

  1. Dot Product of Two Arrays
  int[] a = {1, 2, 3};
int[] b = {4, 5, 6};
int result = 0;
for (int i = 0; i < a.length; i++) result += a[i] * b[i];

  2. Print Integers from String
  String input = "a1b2c3";
for (char c : input.toCharArray()) {
    if (Character.isDigit(c)) System.out.print(c + " ");
}









                            Read More
                          















Autosys Job Status Cheatsheet














    
    
    AutoSys Job Status Cheat Sheet
    



AutoSys Job Status Cheat Sheet

What happens if a Box is on HOLD or ICE?

Action Meaning Effect on Box Effect on Jobs inside
ON HOLD Manual pause Box stays READY, but jobs inside do NOT run Child jobs are NOT allowed to start
ON ICE Freeze Box is frozen, won't even evaluate jobs inside Child jobs are NOT evaluated or started


Box ON HOLD
The box remains active but child jobs do not run until the hold is released.

Box ON ICE
The box and child jobs are frozen. No evaluation happens. After removing ICE, only future events are considered.

AutoSys Job Statuses Explained

Status Meaning Description
INACTIVE Idle Job created, waiting for time/condition to trigger.
ACTIVATED Ready Time/dependency met, waiting to run.
STARTING Launching Starting process initiated but not running yet.
RUNNING Executing Job is currently executing.
SUCCESS Completed Job finished successfully.
FAILURE Failed Job failed due to error.
TERMINATED Killed Manually killed job.
ON_HOLD Paused Manually put on hold; will not run.
ON_ICE Frozen Completely frozen; not even evaluated.
QUE_WAIT Waiting for Queue Waiting for machine queue availability.
WAIT_REPLY Wait for machine Waiting for machine/agent reply.
WAIT_START_TIME Scheduled Wait Waiting for start time.
RESTART Retry Retrying job after failure.
EVENT_WAIT Waiting Event Waiting for an external trigger/event.
UNKNOWN Lost Agent disconnected or unreachable.


How to Change Status (Common Commands)

sendevent -E HOLD_JOB -J jobname – Put a job ON HOLD
sendevent -E RELEASE_JOB -J jobname – Release HOLD from job
sendevent -E JOB_ON_ICE -J jobname – Put a job ON ICE
sendevent -E JOB_OFF_ICE -J jobname – Remove ICE from job
sendevent -E FORCE_STARTJOB -J jobname – Force Start Job
sendevent -E KILLJOB -J jobname – Kill Job


If You See This, Then...

If you see... It means...
INACTIVE Job is waiting for schedule.
ACTIVATED Job ready to start.
RUNNING Job is executing.
SUCCESS Job completed successfully.
FAILURE Job failed, check logs.
ON_HOLD Needs manual release.
ON_ICE Frozen, manual intervention required.
UNKNOWN Lost communication, investigate machine/agent.


Summary
HOLD = Pause but still aware of time/dependency.

ICE = Fully frozen, ignores schedules until manually released.

SUCCESS/FAILURE = Used to decide next steps in job chains.









                            Read More
                          















PySpark Interview Preparation Guide














    
    
    PySpark Interview Preparation Guide


    PySpark Interview Preparation Guide

    Day 1: PySpark Basics & Core Concepts
    
        What is PySpark: Python API for Apache Spark used for large-scale data processing.
        Spark Architecture: Consists of Driver, Executors, Cluster Manager.
        RDD vs DataFrame vs Dataset: RDD is low-level, DataFrame is optimized and user-friendly.
        Transformations vs Actions: Transformations are lazy; Actions trigger computation.
        Lazy Evaluation: Optimization mechanism that delays execution until necessary.
    

    Day 2: RDD Operations & DataFrame API
    
        RDD operations: map, flatMap, filter, reduceByKey.
        DataFrame creation: from RDD or structured data.
        DataFrame methods: select, filter, groupBy, agg, withColumn, drop, cast, alias.
        File formats: CSV, JSON, Parquet reading and writing.
    

    Day 3: Joins, UDFs & SQL in PySpark
    
        Join types: inner, left, right, outer joins.
        SQL queries: Registering temp views and running SQL on DataFrames.
        UDFs: Create custom transformation logic with User Defined Functions.
    

    Day 4: Window Functions & Complex Operations
    
        Window Functions: row_number, rank, dense_rank, lead, lag.
        Partitioning: Use of partitionBy and orderBy in window specs.
        Pivot: Reshape DataFrame using pivot/unpivot operations.
    

    Day 5: Performance Tuning & Optimization
    
        Catalyst Optimizer: Optimizes query plans in Spark SQL.
        Tungsten Engine: Handles memory and binary code optimization.
        Partitioning: Efficient data distribution using repartition and coalesce.
        Caching & Persistence: Store intermediate results in memory or disk.
        Broadcast Join: Used when one dataset is small enough to fit in memory.
    

    Day 6: PySpark with Machine Learning (MLlib)
    
        MLlib: Spark's machine learning library.
        Pipeline: Chain of Transformers and Estimators.
        VectorAssembler: Combine features into a single vector column.
        StandardScaler: Normalize features.
        Models: LinearRegression, LogisticRegression.
    

    Day 7: Real-time Scenarios + Mock Interview
    
        Real-time Use Cases: Handling ETL, ingestion pipelines, and optimizations.
        Performance Bottlenecks: Identifying and resolving slow Spark jobs.
        Common Issues: Data skew, large joins, memory pressure.
        Mock Questions: End-to-end project explanation, tuning strategies, troubleshooting steps.
    

    Use this guide to prepare thoroughly for PySpark interviews from basic to advanced levels. Each day is structured for progressive learning and hands-on practice.








                            Read More
                          















PySpark Interview Question and Answers












PySpark Questions and Answers

  
    
      Question
      Answer
      Explanation
    
  
  
    
      Which method below is used to create a temporary view on DataFrame?
      DataFrame.createOrReplaceTempView("View Name")
      This method registers a DataFrame as a temporary table for SQL queries in Spark.
    
    
      Which of the below Spark Core operations are wide transformations and result in data shuffling?
      groupBy
      groupBy triggers data shuffling across the network, which makes it a wide transformation.
    
    
      Which of the below option is correct to persist RDD only in primary memory?
      rdd.persist(StorageLevel.MEMORY_ONLY)
      This persists the RDD in memory only; if memory is not sufficient, it will recompute when needed.
    
    
      Which method can be used to verify the number of partitions in RDD?
      RDD.getNumPartitions()
      getNumPartitions() returns the number of partitions in an RDD.
    
    
      Which code snippet correctly converts dataset to DataFrame using namedtuple?
      transDF = sc.textFile(...).map(...).map(lambda c: Cust_Trans(c[0], c[1], c[2], int(c[3]))).toDF()
      This code splits the line, maps it to a namedtuple, and converts it to a DataFrame.
    
    
      Which of the below method(s) is/are Spark Core action operations?
      collect(), foreach(), reduce()
      These are Spark actions that trigger computation and return results.
    
    
      Which method is used to read a JSON file as a DataFrame?
      sparkSessionObj.read.json("Json file path")
      This is the standard way to load a JSON file using SparkSession.
    
    
      Which method is used to increase RDD partitions for better parallelism?
      RDD.repartition(Number of partitions)
      repartition increases partitions and involves shuffling for balanced distribution.
    
    
      Which transformation aggregates values by key efficiently in paired RDD?
      reduceByKey()
      reduceByKey performs local aggregation before shuffling, making it efficient.
    
    
      Which method is used to save a DataFrame as a Parquet file in HDFS?
      DataFrame.write.parquet("File path")
      This method saves a DataFrame in Parquet format to the given path.
    
    
      Which object acts as a unified entry point for Spark SQL including Hive?
      SparkSession
      SparkSession is the main entry point for DataFrame and SQL functionality.
    
    
      Which transformation can only be applied on paired RDDs?
      mapValues()
      mapValues transforms only values, keeping keys unchanged; used only on key-value RDDs.
    
    
      Which method is used to save DataFrame to a Hive table?
      DataFrame.write.option("hivepath", "/path").saveAsTable("Banking.CreditCardData")
      This saves the DataFrame as a Hive table with additional options.
    
    
      Which of the following is a Spark Action?
      collect(), first, take
      These actions trigger execution and return results from the RDD/DataFrame.
    
    
      How to extract the first and third column from an RDD?
      data1.map(lambda col: (col[0], col[2]))
      Accessing elements by index allows extraction of specific columns from an RDD.
    
    
      What is the output of foldByKey with add on [('a',1), ('b',2), ('a',3), ('a',4)]?
      [('a', 8), ('b', 2)]
      foldByKey with 0 as initial value sums values grouped by key using add.
    
    
      What is the output of countByValue() on RDD with [(11,1), (1,), (11,1)]?
      [((11, 1), 2), ((1,), 1)]
      countByValue counts how many times each element occurs in the RDD.
    
    
      Which method displays contents of DataFrame as a collection of Row?
      data1.collect()
      collect() returns the content as a list of Row objects.
    
    
      Which object is created by the system in Spark interactive mode?
      SparkSession
      SparkSession is automatically created in interactive mode for convenience.
    
    
      What is the difference between persist() and cache() in Spark?
      cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK)
      cache() is a shorthand for persist with default storage level MEMORY_AND_DISK. persist allows custom storage levels.
    
    
      How does Spark handle data shuffling, and why is it expensive?
      Spark redistributes data across partitions, causing I/O, network, and memory overhead.
      Shuffling involves disk and network operations which slow down performance and require more resources.
    
    
      What is a broadcast variable in Spark and when should it be used?
      Used to cache a read-only variable on all nodes to avoid shipping with tasks.
      Broadcast variables are efficient for small datasets that are reused across many tasks.
    
    
      Explain the difference between narrow and wide transformations with examples.
      Narrow (e.g., map): data from one partition. Wide (e.g., groupByKey): requires shuffle.
      Narrow transformations don't require shuffling. Wide ones do and are more expensive.
    
    
      What are the different storage levels in Spark?
      MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, etc.
      Storage levels define how RDDs are cached – in memory, disk, or serialized form.
    
    
      What is a DAG in Spark, and how is it used in job execution?
      DAG is a Directed Acyclic Graph of stages representing computation lineage.
      Spark builds a DAG of execution for transformations before running any action.
    
    
      What are accumulators in Spark and how are they different from broadcast variables?
      Accumulators are write-only shared variables for aggregations; broadcast are read-only.
      Accumulators are useful for debugging or counters; broadcast for small lookup data.
    
    
      How do DataFrame APIs differ from RDD APIs in Spark?
      DataFrames are optimized using Catalyst and Tungsten; RDDs offer more control.
      DataFrames are higher-level APIs with better performance; RDDs are more flexible but slower.
    
    
      What are some best practices for optimizing Spark jobs?
      Use partitioning, caching, avoid shuffles, use broadcast joins, and monitor jobs.
      Performance improves by reducing shuffles, tuning partitions, and reusing cached data.
    
    
      Explain checkpointing and why it is used in Spark streaming applications.
      Checkpointing saves RDD lineage info to stable storage to recover from failures.
      It helps prevent long lineage chains and supports recovery in streaming jobs.
    
  







                            Read More

Use	Syntax	Example
Rename table	ALTER TABLE old_table_name RENAME TO new_table_name;	ALTER TABLE employees RENAME TO staff;
Add column	ALTER TABLE table_name ADD column_name datatype;	ALTER TABLE employees ADD salary DECIMAL(10,2);
Drop column	ALTER TABLE table_name DROP COLUMN column_name;	ALTER TABLE employees DROP COLUMN salary;
Rename column	ALTER TABLE table_name RENAME COLUMN old_column_name TO new_column_name;	ALTER TABLE employees RENAME COLUMN name TO full_name;
Modify datatype	ALTER TABLE table_name MODIFY column_name new_datatype; (MySQL)	ALTER TABLE employees MODIFY salary BIGINT;
Add primary key	ALTER TABLE table_name ADD PRIMARY KEY (column_name);	ALTER TABLE employees ADD PRIMARY KEY (employee_id);
Drop primary key	ALTER TABLE table_name DROP PRIMARY KEY;	ALTER TABLE employees DROP PRIMARY KEY;
Add foreign key	ALTER TABLE child_table ADD CONSTRAINT fk_name FOREIGN KEY (child_column) REFERENCES parent_table(parent_column);	ALTER TABLE orders ADD CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(id);
Drop foreign key	ALTER TABLE table_name DROP FOREIGN KEY fk_name;	ALTER TABLE orders DROP FOREIGN KEY fk_customer;

Action	Meaning	Effect on Box	Effect on Jobs inside
ON HOLD	Manual pause	Box stays READY, but jobs inside do NOT run	Child jobs are NOT allowed to start
ON ICE	Freeze	Box is frozen, won't even evaluate jobs inside	Child jobs are NOT evaluated or started

Status	Meaning	Description
INACTIVE	Idle	Job created, waiting for time/condition to trigger.
ACTIVATED	Ready	Time/dependency met, waiting to run.
STARTING	Launching	Starting process initiated but not running yet.
RUNNING	Executing	Job is currently executing.
SUCCESS	Completed	Job finished successfully.
FAILURE	Failed	Job failed due to error.
TERMINATED	Killed	Manually killed job.
ON_HOLD	Paused	Manually put on hold; will not run.
ON_ICE	Frozen	Completely frozen; not even evaluated.
QUE_WAIT	Waiting for Queue	Waiting for machine queue availability.
WAIT_REPLY	Wait for machine	Waiting for machine/agent reply.
WAIT_START_TIME	Scheduled Wait	Waiting for start time.
RESTART	Retry	Retrying job after failure.
EVENT_WAIT	Waiting Event	Waiting for an external trigger/event.
UNKNOWN	Lost	Agent disconnected or unreachable.

If you see...	It means...
INACTIVE	Job is waiting for schedule.
ACTIVATED	Job ready to start.
RUNNING	Job is executing.
SUCCESS	Job completed successfully.
FAILURE	Job failed, check logs.
ON_HOLD	Needs manual release.
ON_ICE	Frozen, manual intervention required.
UNKNOWN	Lost communication, investigate machine/agent.

Question	Answer	Explanation
Which method below is used to create a temporary view on DataFrame?	DataFrame.createOrReplaceTempView("View Name")	This method registers a DataFrame as a temporary table for SQL queries in Spark.
Which of the below Spark Core operations are wide transformations and result in data shuffling?	groupBy	groupBy triggers data shuffling across the network, which makes it a wide transformation.
Which of the below option is correct to persist RDD only in primary memory?	rdd.persist(StorageLevel.MEMORY_ONLY)	This persists the RDD in memory only; if memory is not sufficient, it will recompute when needed.
Which method can be used to verify the number of partitions in RDD?	RDD.getNumPartitions()	getNumPartitions() returns the number of partitions in an RDD.
Which code snippet correctly converts dataset to DataFrame using namedtuple?	transDF = sc.textFile(...).map(...).map(lambda c: Cust_Trans(c[0], c[1], c[2], int(c[3]))).toDF()	This code splits the line, maps it to a namedtuple, and converts it to a DataFrame.
Which of the below method(s) is/are Spark Core action operations?	collect(), foreach(), reduce()	These are Spark actions that trigger computation and return results.
Which method is used to read a JSON file as a DataFrame?	sparkSessionObj.read.json("Json file path")	This is the standard way to load a JSON file using SparkSession.
Which method is used to increase RDD partitions for better parallelism?	RDD.repartition(Number of partitions)	repartition increases partitions and involves shuffling for balanced distribution.
Which transformation aggregates values by key efficiently in paired RDD?	reduceByKey()	reduceByKey performs local aggregation before shuffling, making it efficient.
Which method is used to save a DataFrame as a Parquet file in HDFS?	DataFrame.write.parquet("File path")	This method saves a DataFrame in Parquet format to the given path.
Which object acts as a unified entry point for Spark SQL including Hive?	SparkSession	SparkSession is the main entry point for DataFrame and SQL functionality.
Which transformation can only be applied on paired RDDs?	mapValues()	mapValues transforms only values, keeping keys unchanged; used only on key-value RDDs.
Which method is used to save DataFrame to a Hive table?	DataFrame.write.option("hivepath", "/path").saveAsTable("Banking.CreditCardData")	This saves the DataFrame as a Hive table with additional options.
Which of the following is a Spark Action?	collect(), first, take	These actions trigger execution and return results from the RDD/DataFrame.
How to extract the first and third column from an RDD?	data1.map(lambda col: (col[0], col[2]))	Accessing elements by index allows extraction of specific columns from an RDD.
What is the output of foldByKey with add on [('a',1), ('b',2), ('a',3), ('a',4)]?	[('a', 8), ('b', 2)]	foldByKey with 0 as initial value sums values grouped by key using add.
What is the output of countByValue() on RDD with [(11,1), (1,), (11,1)]?	[((11, 1), 2), ((1,), 1)]	countByValue counts how many times each element occurs in the RDD.
Which method displays contents of DataFrame as a collection of Row?	data1.collect()	collect() returns the content as a list of Row objects.
Which object is created by the system in Spark interactive mode?	SparkSession	SparkSession is automatically created in interactive mode for convenience.
What is the difference between persist() and cache() in Spark?	cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK)	cache() is a shorthand for persist with default storage level MEMORY_AND_DISK. persist allows custom storage levels.
How does Spark handle data shuffling, and why is it expensive?	Spark redistributes data across partitions, causing I/O, network, and memory overhead.	Shuffling involves disk and network operations which slow down performance and require more resources.
What is a broadcast variable in Spark and when should it be used?	Used to cache a read-only variable on all nodes to avoid shipping with tasks.	Broadcast variables are efficient for small datasets that are reused across many tasks.
Explain the difference between narrow and wide transformations with examples.	Narrow (e.g., map): data from one partition. Wide (e.g., groupByKey): requires shuffle.	Narrow transformations don't require shuffling. Wide ones do and are more expensive.
What are the different storage levels in Spark?	MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, etc.	Storage levels define how RDDs are cached – in memory, disk, or serialized form.
What is a DAG in Spark, and how is it used in job execution?	DAG is a Directed Acyclic Graph of stages representing computation lineage.	Spark builds a DAG of execution for transformations before running any action.
What are accumulators in Spark and how are they different from broadcast variables?	Accumulators are write-only shared variables for aggregations; broadcast are read-only.	Accumulators are useful for debugging or counters; broadcast for small lookup data.
How do DataFrame APIs differ from RDD APIs in Spark?	DataFrames are optimized using Catalyst and Tungsten; RDDs offer more control.	DataFrames are higher-level APIs with better performance; RDDs are more flexible but slower.
What are some best practices for optimizing Spark jobs?	Use partitioning, caching, avoid shuffles, use broadcast joins, and monitor jobs.	Performance improves by reducing shuffles, tuning partitions, and reusing cached data.
Explain checkpointing and why it is used in Spark streaming applications.	Checkpointing saves RDD lineage info to stable storage to recover from failures.	It helps prevent long lineage chains and supports recovery in streaming jobs.


                  





Older Posts



Home







Subscribe to:

Posts
                      (
                      Atom
                      )

Table of Contents

Shell Scripts

1. Find Second Largest + Second Smallest and Add Them

2. Find Unique Characters and Longest Substring Without Repetition

3. Reverse a String with Tests

4. Print Line 50 to 60

5. Search Pattern and Count Occurrence

6. Find Dot Product of Two Arrays

7. Check and Print Integers from a String

8. Find 3rd Least Array Value

9. Check Process Status

10. Delete Blank Lines

11. Lines Greater Than 5 Characters (awk)

12. Print Last 10 Lines

13. Check Palindrome

14. Check File Exists, Size, Send Email

AutoSys & Job Scheduling

On Hold vs On Ice

Predecessor and Successor

Job Schedule Example (Cron)

Migration Testing Points

Java Programming

Kill Process Tree

Find Odd and Even Numbers

Character Frequency Count

Longest Occurrence of Character

Python Programs

Find Missing Characters to Form Pangram

First Non-Repeating Character

Unix, Git, CI/CD, SDLC Knowledge

Unix Commands

CI/CD Pipeline Flow

SDLC Phases

Git Rebase

Merge Conflict Resolution

SQL and Database Interview Questions and Answers

1. What is normalization?

2. Difference between TRUNCATE and DELETE

3. Explain DDL and DML

4. What is an index?

5. Explain joins

6. Definition of SQL, DDL, JOIN

7. What is a primary key and its advantages?

8. Write a query for employee-manager hierarchy

9. How to sum two values without using functions?

10. What are constraints in SQL? Why are they used?

11. Fetch duplicate records from a table

12. Write SQL to get the third highest salary

13. Get manager name using self join

14. What is a view?

15. What is a materialized view? Difference from a normal view?

16. Can we update a view?

17. Can we use DELETE without WHERE clause?

18. Reasons for SQL procedure delay?

19. What are triggers?

20. How to establish DB connection?

21. What is EXPLAIN PLAN?

22. Difference between DELETE, DROP, and TRUNCATE

23. Use of views?

24. Types of joins in Oracle?

25. Write query where empid = managerid

26. What are JDBC parameters?

27. What is SQLException?

All Interview Questions and Answers

Shell Script Questions

1. Find Second Largest and Second Smallest and Add

2. Reverse a String

3. Find Unique Characters and Print Longest Possible Substring

4. Print Line 50 to 60 of a File

5. Search Pattern and Count Frequency

6. Shell Script: Palindrome Check

7. Check File Exists and Size and Send Email

8. Delete Blank Lines in a File

9. AWK: Print Lines Greater Than Length 5

10. Print Last 10 Lines

AutoSys Related Questions

On Hold vs On Ice

JIL File

Predecessor/Successor Concept

Job Schedule