quot; or they are blank, indicating that the player isn't signed for those years. The code in **B6** that will convert everything to numbers for us: ```python # excel:python B6 = cc1 = xl("current_contracts[#All]", headers=True) # Identifying columns containing annual salary information salary_columns = ["2023-24", "2024-25", "2025-26", "2026-27", "2027-28"] # Converting salary columns to numerical values by removing non-numeric characters for col in salary_columns: cc1[col] = cc1[col].replace('[\$,]', '', regex=True).astype(float) # Keeping only the player name and annual salary information columns cc2 = cc1[["Player"] + salary_columns] ``` I construct a list of all the salary columns that I can reuse in other cells and then use a for loop to iterate over them. For each column, I'm using the vectorized replace() function in Pandas to strip all "quot; characters from each value before converting the result to a floating point number. It's important to always use vectorized functions where possible in Pandas for both improved readability and faster execution. ## Sidebar: What's the calculation order of the Python cells? Python cells execute left-to-right, top-to-bottom starting from the left-most sheet in the workbook to the right-most sheet. I call this row-major order. Note that this is different from how Excel formulas are evaluated; Excel determines the optimal execution sequence for the cells by analyzing dependencies between the cells. In the Python case, this analysis is much more complex and it cannot be successful 100% of the time. Furthermore, Python global variables must be defined before use and Python globals are scoped to all Python cells in the workbook. Because of this, Excel uses a defined execution order. ## Enriching the player salary data with draft year and date The draft_data table contains information about the year that a player was drafted and their pick within the draft. I can use the information in this table to enrich the table that I will use to train the model. ```python # excel:python B7 = dd1 = xl("draft_data[#All]", headers=True) # Selecting the relevant columns for draft information[[AM3]](#_msocom_3) draft_info = dd1[["Player", "DraftYear", "Pk"]] # Renaming the columns to make them descriptive dd2 = draft_info.rename(columns={"DraftYear": "Draft Year", "Pk": "Draft Position"}) # Merging the salary table (filtered_data) with the draft information (draft_info) on the player names td1 = pd.merge(cc2, dd2, on="Player", how="left") ``` The Pandas merge() function is a simple way of merging two tables that have a common key, Player in this case. If you are an experienced Jupyter user, you may be wondering how Excel decides what value to return to the grid. The result of an assignment expression in Python is None, so typically you’d write td1 as the last line of code in the cell if that’s the object that you want returned to the grid. Excel uses Jupyter as the execution back end and it configures the TerminalInteractiveShell.ast_node_interactivity = last_expr_or_assign so that the value of the last assignment instead of the result is returned to the grid. ## Identifying elite NBA players The criteria that I'll use for identifying an elite player is calculating whether the annual average of their non-rookie contract falls in the top 20% of all players. The first thing that I need to do is calculate the years of service for a player. This will make it easier for us to determine rookie status: ```python # excel:python B8 = # Calculating the years of service by subtracting the draft year from the current year (2023) current_year = 2023 td2 = td1.copy() td2["Years of Service"] = current_year - td1["Draft Year"] td2 ``` You will notice that I am also using vectorized calculations in this cell. A single line of code computes the difference between the current year and the draft year column, and the subtraction operator implicitly performs the calculation over all rows in the DataFrame. Generally, the use of a for loop over the contents of a DataFrame tells you that you're probably doing something wrong. I can use the years of service column to categorize players as rookies or veterans. This is computed using the algorithm in **B10**. Here’s the first example of declaring a Python function in a cell. The categorize_player() function lets us encapsulate the categorization logic in a function which I can then apply over the dataset using the Pandas apply() function. ```python # excel:python B9 = # Function to categorize players based on the given criteria def categorize_player(row): global salary_columns years_of_service = row["Years of Service"] # Check for undrafted players if pd.isna(years_of_service): return 3 # Check for veterans if years_of_service > 4: return 2 # Check for rookies if years_of_service <= 2: return 0 # Check for rookies with contract extension (3 or 4 years of service) if 3 <= years_of_service <= 4: for i in range(len(salary_columns) - 1): current_salary = row[salary_columns[i]] next_salary = row[salary_columns[i + 1]] if pd.notna(current_salary) and pd.notna(next_salary) and next_salary > 2 * current_salary: return 1 return 0 # Apply the function to each row and create the new column td3 = td2.copy() td3["Player Category"] = td2.apply(categorize_player, axis=1) td3 ``` Notice that it even handles the case of rookies who have signed contract extension. To categorize players as elite or not, I want to use the extension value of the contract and not the rookie value of the contract. The new Player Category column lets me use different algorithms to calculate the average value of the non-rookie components of veteran players or rookies with contract extensions: ```python # excel:python B10 = # Function to calculate the average contract value based on the player # category def calculate_average_contract(row): global salary_columns salaries = row[salary_columns].dropna() # If the player is a rookie with a contract extension, find the # significant increase and calculate the average after that if row["Player Category"] == 1: for i in range(len(salary_columns) - 1): current_salary = row[salary_columns[i]] next_salary = row[salary_columns[i + 1]] if (pd.notna(current_salary) and pd.notna(next_salary) and next_salary > 2 * current_salary): return salaries[i + 1:].mean() # For other categories, simply calculate the average of all salaries return salaries.mean() # Apply the function to each row and create the "Average Contract Value" # column td4 = td3.copy() td4["Average Contract Value"] = td3.apply(calculate_average_contract, axis=1) td4 ``` Now that I have computed the Average Contract Value column, I can use it to compute values for the Elite Status column that indicates whether a player is considered elite: ```python # excel:python B11 = # Calculate the 80th percentile of the "Average Contract Value" to identify the threshold for elite players elite_threshold = td4["Average Contract Value"].quantile(0.8) # Create the "Elite Status" column based on whether the average contract value is above the elite threshold td5 = td4.copy() td5["Elite Status"] = (td4["Average Contract Value"] > elite_threshold).astype(int) td5 ``` This is what the td5 DataFrame looks like after this step: ![[Pasted image 20231015213646.png]] The final two steps are pretty routine: use the Player column in td5 to merge the cleaned college stats DataFrame in cp3 using the player_name column. ```python # excel:python B12 = td6 = pd.merge(td5, cp3, left_on="Player", right_on="player_name", how="left") ``` Finally, I need to remove any rows that have missing values in the player_name column, as those are players who are missing college stats, that is, they weren’t drafted from college. There are also some stats that didn’t exist in earlier years of college basketball, so I want to exclude columns that have missing values in them. ```python # excel:python B13 := # Remove any rows that have missing values in player_names column td7 = td6.dropna(subset=['player_name']) # Iterate through the columns and drop those with missing values, excluding salary_columns columns_to_remove = [] for col in td7.columns: if col not in salary_columns and td7[col].isna().any(): columns_to_remove.append(col) # Remove additional columns columns_to_remove += ['player_name', 'year'] td8 = td7.drop(columns=columns_to_remove) ``` The final DataFrame contains college stats for 315 current players in the NBA. # Excel Concepts This is a quick overview of Excel concepts that may be surprising or require a precise definition for Python programmers. ## Workbooks, Sheets, Cells, and Formulas An Excel workbook contains a set of sheets. Each sheet is a grid of cells. Cells are identified using letter(s) and number(s). The range of columns identified by letters is A-XFD (16384 columns) and the range of rows identified by numbers is 1-1048576 (1048576 rows). A cell can be blank, contain a value, and optionally a formula. If a cell contains a value, it can be one of the following types: text (**A1**), number (**A2**), boolean (**A3**), error (**A4**), rich data types (**A5**) an example of which is a stock price. ![[Pasted image 20231015213849.png]] If the cell contains a formula, the output of that formula is the value that is stored in that cell or the surrounding cells if the formula can spill values into the grid. In the example below, **B1** contains a formula to sort the values in **A1**:**A4**. The sorted values clearly don't fit in just a single cell, so Excel _spills_ the values into cells **B1**:**B4**. ```excel # excel B1 = SORT(A1:A4) ``` ![[Pasted image 20231015213924.png]] ## Excel Errors If the value computed by a cell is a runtime error, for example, a division by zero error, that value is stored in the cell. That value is also propagated to any cells that depend on that cell, which helps users diagnose errors by tracing values back to the originating cell. Consider the following formulas: ```excel # excel A1 = 1/0 # excel B2 = A1+1 # excel C3 = SUM(B1:B2) ``` For other formula errors like syntax errors, Excel will highlight the error in the cell and in some cases like on Excel for Windows or Mac will not allow the user to commit the formula. The original divide by zero error occurred in **A1** but gets propagated to **B2** and **C3**. You can also use the Excel [Trace Precedents](https://support.microsoft.com/en-us/office/display-the-relationships-between-formulas-and-cells-a59bef2b-3701-46bf-8ff1-d3518771d507) command to show how the value in **C3** depends on the values in **B1** and **B2**: ![[Pasted image 20231017092619.png]] ## Cell References TODO: move to appendix? Formulas [can reference other cells in the same workbook](https://support.microsoft.com/en-us/office/create-or-change-a-cell-reference-c7b8b95d-c594-4488-947e-c835903cebaa). An unqualified cell reference references another cell on the same worksheet, for example, **Q12** must reference cell **Q12** on the same sheet as the cell that contains the formula. If a cell reference needs to reference a cell on a different sheet, that cell reference must be qualified by the name of the other sheet, for example, **Sheet2!A1**. There are more complex quotation rules involved in forming a reference to a cell on another sheet. Cell references are shown to you using absolute references, that is, the letter/number format, for example, **A1** or **ZZ1000**. Internally, those cell references are stored as _relative references_, for example, “2 rows above and 5 columns to the left”. You can place [Excel in R1C1 mode](https://learn.microsoft.com/en-us/office/troubleshoot/excel/numeric-columns-and-rows) to see this for yourself. ![[Pasted image 20231017092640.png]] Below is a simple example: the first cell in row 1, column 1 contains the value 1. The second cell in row 2, column 1 contains a reference to the row above (-1) in the same column and adds 1 to it resulting in the value 2. The third cell in row 3, column 2 contains a reference to the row above (-1) and the column to the left (-1) and adds 1 to it resulting in the value 3. ```python # excel R1C1 = 1 # excel R2C1 = R[-1]C+1 # excel R3C2 = R[-1]C[-1]+1 ``` ![[Pasted image 20231017092715.png]] Since cell references are always stored as relative references, this explains how Excel cells can be easily copied and have their cell references _adjusted_. Consider this example: ```python # excel A1 = 42 # excel B1 = A1+1 # excel C1 = B1+1 ``` If I switch to R1C1 notation, you can see that the cell references in **B1** and **C1** are really: ```python # excel A1 = 42 # excel B1 = RC[-1]+1 # excel C1 = RC[-1]+1 ``` When I copy the formulas to **A2**-**C2**, _the formulas don’t change_ because of relative referencing: ```python # excel A2 = 42 # excel B2 = RC[-1]+1 # excel C2 = RC[-1]+1 ``` This is what it looks like in the grid with cell **B2** selected: ![[Pasted image 20231017092853.png]] Formulas can reference cells in other workbooks as well through [linked workbooks](https://support.microsoft.com/en-us/office/create-an-external-reference-link-to-a-cell-range-in-another-workbook-c98d1803-dd75-4668-ac6a-d7cca2a9b95f). The reference is created when both workbooks are opened within an Excel instance. Once the reference is created, Excel maintains a cache of those values so that it can continue to calc even if the other workbook is closed or unavailable. Formulas can reference other cells in workbooks open in the same Excel instance, and can even reference cells in a closed workbook when you create an [external reference](https://support.microsoft.com/en-us/office/create-an-external-reference-link-to-a-cell-range-in-another-workbook-c98d1803-dd75-4668-ac6a-d7cca2a9b95f). ## Excel Functions Formulas can contain functions. Excel has hundreds of [built-in functions](https://support.microsoft.com/en-us/office/excel-functions-alphabetical-b3944572-255d-4efb-bb96-c6d90033e188) and users can write their own user-defined functions using languages like Excel Formula Language ([LAMBDA functions](https://support.microsoft.com/en-us/office/lambda-function-bd212d27-1cd1-4321-a34a-ccbf254b8b67)), Visual Basic for Applications (VBA), JavaScript, or by using a native language like C++ when using Excel for Windows. Almost all Excel functions are _pure_, that is, the value of the output is solely determined by the values of the inputs. If you call the function multiple times with the same inputs, you are guaranteed to get the same outputs. For example, the [SUM() function](https://support.microsoft.com/en-us/office/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89#:~:text=The%20SUM%20function%20adds%20values,values%20in%20cells%20A2%3A10.) computes the sum of the ranges passed in as input parameters and returns the sum of those ranges as its value. There are functions which, by design, return different values each time that you call them. Some examples of these are the [TODAY()](https://support.microsoft.com/en-us/office/today-function-5eb3078d-a82c-4736-8930-2f51a028fdd9) function and the [RAND()](https://support.microsoft.com/en-us/office/rand-function-4cbfa695-8869-4788-8d90-021ea9f5be73) function (for a complete list see [the documentation](https://learn.microsoft.com/en-us/office/client-developer/excel/excel-recalculation#volatile-and-non-volatile-functions)). These functions are marked as [_volatile_ functions](https://learn.microsoft.com/en-us/office/dev/add-ins/excel/custom-functions-volatile). An important implication of volatile functions is that they are recalculated every time that Excel recalculates. This is extremely useful in certain scenarios such as Monte Carlo simulations. However, this flexibility comes at a significant computational cost so be careful how much you use these functions. If you are authoring user defined functions it is your responsibility to mark functions as being volatile as only you know whether your function returns different values when given the same inputs. ## Recalculation modes: Automatic, Partial and Manual By default, Excel performs _automatic recalc_. When you change a value of a cell, Excel will automatically recalculate the values of any dependent cells. If you change the value of A1 from 1 to 2, Excel will recalculate the values of B1 and C1 because they are dependent on the value of A1. The values of B1 and C1 before recalc are 2 and 2 and after recalc are 3 and 4. ```python # excel A1 = 1 # excel B1 = A1+1 # excel C1 = A1*2 ``` Excel automatic recalc can detect and work with circular references in cells as well. For more details, see the [documentation](https://support.microsoft.com/en-us/office/remove-or-allow-a-circular-reference-8540bd0f-6e97-4483-bcf7-1b49cd50d123). Sometimes you want cells that contain expensive operations to be manually recalculated, but have other cells automatically recalculated. In this case you can specify "Partial" recalculation mode which lets you exclude [Data Tables](https://support.microsoft.com/en-us/office/calculate-multiple-results-by-using-a-data-table-e95e2487-6ca6-4413-ad12-77542a5ea50b) and Python cells from automatic recalculation. You can find this option under the Formulas tab in the ribbon in the Calculation section: ![[Pasted image 20231017093002.png]] If you want maximum control, Excel can be placed in manual recalculation mode. In this mode changing the value of a cell doesn't trigger recalculation of dependent cells. Instead, you must tell Excel when you want it to calculate new values by clicking on the **Calculate Now** button in the ribbon or pressing the **F9** key. ## Stale Cells Excel has a new mode which shows _stale_ cells_,_ which are cells that Excel knows need to be recalculated because a dependency changed in the sheet. You can enable this mode by toggling the Format Stale Values mode under Calculation Options under the Formulas tab in the ribbon: ![[Pasted image 20231017093019.png]] These cells are shown with a strikeout through the cell value. Below you can see **A2** adds 1 to the value of **A1**. If I have the spreadsheet in manual recalculation mode and I change the value of **A1** from 100 to 99, you can see that **A2** is now crossed out. When **A2** is selected, a yellow triangle icon appears to the right that warns you that the cell value is stale. ![[Pasted image 20231017093035.png]] This also works with Python cells. ```python # excel A3 := xl("A1")+2 ``` Below you can see what happens if I edit **A1** and change the value to 1: because both **A2** and **A3** depend on the value of **A1** and Excel is in manual recalculation mode, the cell values in **A2** and **A3** are both rendered with a strikethrough. ![[Pasted image 20231017093107.png]] You may want to use this mode with Python if you have long running Python functions to minimize the amount of time that is spent recalculating the grid when you change a value or a formula. ## Cell Formatting Cells can also have formatting applied to them as well. One notable example is dates, which are stored as numbers but _formatted as dates_. ![[Pasted image 20231017093137.png]] **A1** and **C1** are formatted as General and **B2** is formatted as a Long Date. You can perform arithmetic on dates as well. The whole number portion, 12345, represents the number of days since Excel epoch, Jan 1, 1900. The fractional number portion, 0.6789, represents fractional portion of the day. # The Excel Type System This section discusses the Excel type system and how Python types integrate with Excel types. ## Excel Values and Python Objects Python in Excel integrates the Python type system with Excel's. It defaults to returning Python values as Excel objects (I also call this return by reference in this document), but you can switch to returning as Excel values using this drop down: ![[Pasted image 20231017093154.png]] You can also use the keyboard shortcut **Ctrl+Alt+Shift+M** to toggle between these two output modes. Let's consider the case of a Python number. **A1** returns 3 as an Excel object (you can easily tell by the icon on the left-hand side of the cell). **A2** returns 3 as an Excel value. **A3** computes the sum of **A1** and **A2** where Excel objects and Excel values are both treated as Excel numbers. ```python # excel:python A1 = 3 # excel:python A2 := 3 # excel A3 = A1+A2 ``` ![[Pasted image 20231017093237.png]] This works because Excel treats _single value objects_ as if they were values during computation. It does so by storing _both_ the object reference and the value in the cell. Excel also renders the single value of the object in the cell. This lets you use the default of returning values from Python as Excel objects without needing to tediously switch from returning objects to values. Other examples include Python float, Boolean and str: ```python # excel:python A1 = 42.0 # excel:python A2 = "Hello, World" # excel:python A3 = True ``` ![[Pasted image 20231017093311.png]] In cases where a Python object has multiple values, Excel will show the object type as the value in the cell. You can inspect the object's values by clicking on the icon in the cell: ```python # excel:python B1 = xl("A1:A3") ``` ![[Pasted image 20231017093340.png]] ## Python Values are Excel Values Values returned to the Excel grid _are_ Excel values and are indistinguishable from values produced by Excel formulas or values manually entered into a cell. In the following example, the Excel functions ISNUMBER() ISTEXT() ISLOGICAL() all return TRUE, regardless of whether the values that they are testing came from Excel or from Python and regardless of whether they were returned by value or by reference. ```python # excel:python A1 = 42 # excel:python A2 := 42 # excel:python A3 = "Hello, World" # excel:python A4 := "Hello, World" # excel:python A5 = True # excel:python A6 = True # excel B1 = ISNUMBER(A1) # excel B2 = ISNUMBER(A2) # excel B3 = ISTEXT(A3) # excel B4 = ISTEXT(A4) # excel B5 = ISLOGICAL(A5) # excel B6 = ISLOGICAL(A6) ``` ![[Pasted image 20231017093437.png]] Values produced by Python can participate in the Excel _calculation chain_. ```python # excel:python A1 := 3+4 # excel:python A2 := 5+6 # excel A3 = SUM(A1:A2) ``` ![[Pasted image 20231017094954.png]] When I change values in a Python formula, for example, changing 4 to 5 in **A1**, Excel will automatically recalculate all dependent cells. The value in **A3** changes from 16 to 17. This is the same behavior as if you had computed those values using Excel formulas instead of using Python. ## Python Numbers Let's take a closer look at how Python in Excel handles numbers. **A1** computes a value using the Excel formula language. ```python # excel A1 = 3+4 ``` **B1** is a Python cell that computes the same value. ```python # excel:python B1 = 3+4 ``` Numbers with a decimal point in Python are floating point numbers. **C1** is a floating point number that is returned to Excel: ```python # excel:python C1 = 3. ``` Both Python and Excel represent floating point numbers using [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) representation. If the result of a Python cell is a float, Excel will receive a bit-for-bit identical copy of the IEEE 754 floating point number computed by Python. Numbers without a decimal point in Python are integers by default. Excel, on the other hand, does not have the concept of an integer; all numbers are floating point numbers. Excel will convert Python integers into Excel floating point numbers when returning the result of integer calculations to the grid. This conversion is lossy, however, as some integers cannot be represented exactly in floating point. TODO: appendix? The size of Python integers is only limited by memory, which means that a Python integer converted to a float can overflow the limits of numbers that can be represented by an IEEE 754 float. If the result of the conversion is greater than the maximum value of IEEE 754 floating point numbers, ~1.797x10308, a #NUM! error is returned as the value. If the number is less than the minimum value of IEEE 754, ~2.225x10-308, the value is rounded to zero. ```python # excel:python A2 := 1.798e308 # excel:python A3 := 1.797e308 # excel:python A4 = 1.797e308 # excel:python B2 := 2.225e-308 # excel:python B3 := 2.226e-308 # excel:python B4 = 2.226e-308 ``` ![[Pasted image 20231017095147.png]] Python [decimal](https://docs.python.org/3/library/decimal.html) numbers, which provide exact representations for decimal numbers, are converted to floats using the same rules as Integer numbers. Python [complex](https://docs.python.org/3/library/cmath.html) numbers are not automatically converted by Excel and will generate a result of #N/A instead. ```python # excel:python A1 := 1+2j ``` ![[Pasted image 20231017095519.png]] Python gives you access to the real and imaginary components of the complex number via the real and imag fields of the object. You could choose to return those values by spilling horizontally onto the grid by creating a list of lists: ```python # excel:python A2 := c=1+2j;[[c.real, c.imag]] ``` ![[Pasted image 20231017095553.png]] In Python there are arithmetic operations that can result in not-a-number (NaN) values. A simple example is subtracting infinity from infinity: ```python # excel:python A1 := float("inf")-float("inf") # excel:python A2 = float("inf")-float("inf") ``` ![[Pasted image 20231017095609.png]] In Python the result of that expression is a NaN, and if I inspect the value of the Excel object in **A2** you can see the nan result in the card view. In **A1**, I ask for the result as an Excel value and you can see that Python in Excel converts the NaN into a #NUM! error. ## Python Strings Python cells can return Unicode strings. C1 will return "Hello, World!" to Excel. ```python # excel:python C1 := "Hello, " + "World!" ``` TODO: appendix? TODO: Perhaps another important edge case to call out is that ‘\n’ characters in strings are erased when displayed, and there is no way to display multi-line text from a Python cell. Excel has a limit of 32767 characters in a cell[[Gv1]](#_msocom_1) , but Python strings can be much longer than this. Strings beyond the maximum length result in a #VALUE! error as you can see in **A2**. If you ask Excel to store the result as an Excel object, the value of the string is stored in the cell in addition to the handle to the Python object. In **A3** you can see the value of the string because it fits within the maximum length limit. In **A4** you can see that it shows you the type because the string is too large to be stored in the cell. ```python # excel:python A1 := "x"*32767 # excel:python A2 := "x"*32768 # excel:python A3 = "x"*32767 # excel:python A4 = "x"*32768 ``` ![[Pasted image 20231017095853.png]] Just like we've seen with numbers, you can use Python strings returned as Excel objects directly in Excel functions without needing to convert. **A3** succeeds because the total length of the strings in **A1** and **A2** is 32767. **A6** fails with a #CALC! error because the total length of the strings in **A4** and **A5** is 32768, which is larger the maximum string length in Excel. ```python # excel:python A1 = "x"*16383 # excel:python A2 = "x"*16384 # excel A3 = CONCAT(A1,A2) # excel:python A4 = "x"*16384 # excel:python A5 = "x"*16384 # excel A6 = CONCAT(A4,A5) ``` ![[Pasted image 20231017095932.png]] ## Python Dates TODO: Key information is (a) Excel doesn’t know about date/time, they are represented as floats and the date/time rendering is due to formatting options; (b) Python datetime object values are sent to Excel as numbers, but you have to apply the formatting to render them as such. And maybe (c) Excel doesn’t know timezones, and the user’s timezone is used. (The details about what happens when the user changes timezones mid-session seems too esoteric to go into.) TODO: Maybe parenthesize the bit about the 1904 date system? According to the link it seems to exist for compatibility with pre-2011 Mac Excel only. TODO: Maybe give a clue that 12345 corresponds to 10/18/1933? The “# excel:python” notation shows its weakness here, I skipped the paragraph of text and tried to understand how the code produced the output shown – which cannot be understood without knowing that you formatted the first and 3rd cell as short date but not the 2nd. Excel does not have a concept of a datetime. Instead cells that display dates do so because they have [date-specific formatting applied to the cell](https://support.microsoft.com/en-us/office/format-a-date-the-way-you-want-8e10019e-d5d8-47a1-ba95-db95123d273e#:~:text=what%20you%20want.-,Select%20the%20cells%20you%20want%20to%20format.,format%20you%20want%20in%20Type.). The value of the cell is a serial floating point number. The portion of the number before the decimal point is the number of days since 1900 or 1904, [depending on Excel settings](https://support.microsoft.com/en-us/office/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487). The portion of the number after the decimal point is the fractional part of a day, for example, 12:00PM is represented as 0.5 because it is half of a day. Consider the following examples. **A1** contains the number 12345.6789 and the cell is formatted as a short date. **A2** uses the Python datetime library to marshal a datetime object to Excel as an Excel object. The value that is stored in the cell is a number: 12345 and the cell I formatted as a number. **A3** creates the same object but the cell is formatted as a **short date**. Inspecting the card shows that the underlying handle to the Python object is a datetime object. ```python # excel A1 = 12345.6789 # excel:python A2 = import datetime as dt dt.datetime(1933,10,18) # excel:python A3 = dt.datetime(1933,10,18) ``` ![[Pasted image 20231017102153.png]] In the example below **A1** and **A2** contain the same number, but are formatted as a long date and time respectively: ```python # excel A1 = 12345.6789 # excel A2 = 12345.6789 ``` ![[Pasted image 20231017102225.png]] Python cells that return datetime objects are automatically converted into Excel serial dates, respecting the global setting for 1900 or 1904 date systems. ```python # excel:python A1 = day = dt.datetime(1933,10,18) hours = dt.time(12,0) dt.datetime.combine(day,hours) # excel:python A2 = xl("A1").hour # excel:python A3 = xl("A1").minute ``` **A1** is rendered as **Time** and **A2** and **A3** are rendered as **General**. Since **A1** contains a reference to the Python datetime object, you can reference attributes of that object from other cells, as you can see in **A2** and **A3**. ![[Pasted image 20231017102310.png]] A datetime object can be manipulated by Excel formulas because Excel marshals datetime objects to the grid as numbers. Below, **A2** adds one day to the datetime object computed by **A1**. ```python # excel:python A1 = dt.datetime(1933, 10, 18) # excel A2 = A1+1 ``` ![[Pasted image 20231017102347.png]] If a cell contains a formatted datetime value and it is referenced by a Python cell, it is automatically converted to a datetime object. In **A2** I add a day to the datetime object that is returned by the call to the xl() function. ```python # excel A1 = DATE(1933, 10, 18) # excel:python A2 = xl("A1") + datetime.timedelta(days=1) ``` Both cells are formatted as **Short Date**, and you can see the next day in **A2**. ![[Pasted image 20231017102422.png]] Excel does not understand time zones. It is assumed that any datetime values in cells are in the user's current time zone as determined by the operating system settings on the computer that they are using. When the Python runtime is initialized, typically the first time a Python cell needs to be executed, it is initialized to the same time zone as the user. If the user changes time zones while the Python runtime is running, the time zone in the Python runtime is not automatically updated. # Python integration with Excel Python cells can contain multiple expressions. Only the value of the last expression evaluated in a cell will be returned as the value of the cell. The value of **A1** in this example is 35, the result of evaluating the last expression in the cell, 5*7: ```python # excel:python A1 = 3+4 5*7 ``` ![[Pasted image 20231017102504.png]] Cells can contain more than just Python expressions; they can also contain Python statements, import statements, function definitions and even class definitions (which technically are just statements). Any legal Python code can be included in a cell. Here's an example of importing some commonly used libraries. In Python the result of an import statement is None, and you see NoneType in **A1**. The reason why you see NoneType instead of None is that Excel will show a preview of the value only if there is an exact representation of that value in Excel. Because None cannot be converted to an Excel value, the _type_ of None which is NoneType is shown instead. This is consistent with how Excel behaves with types like DataFrame. ```python # excel:python A1 = import numpy as np import pandas as pd import pandas.pyplot as plt ``` ![[Pasted image 20231017102543.png]] Python _statements_ typically return None as their result. For example, the result of a for loop is None which may not be intuitive, which Excel displays as NoneType in the grid or None if we return as an Excel value: ```python # excel:python A1 = x=0 for i in range(10): x = x + i ``` ![[Pasted image 20231017102624.png]] To return the computation result to the grid you will need to add x to the end of the code block which returns the value of the x as the value of **A1**: ```python # excel:python A1 = x=0 for i in range(10): x = x + i x ``` ![[Pasted image 20231017102659.png]] You may be tempted to _return values to the cell_ using the return statement. So instead of writing: ```python # excel:python A1 = return 42 ``` You should instead write: ```python # excel:python A1 = 42 ``` Since your code executes at global scope, any variables that you define within a cell are global variables, which means they can be referenced from any other cell in Python. The variable x is defined in **A1**: ```python # excel:python A1 = x=1 ``` Can be referenced in **B1** which produces the value 2: ```python # excel:python B1 = x+1 ``` Python cells execute in a well-defined order: left to right, top to bottom. This gives you a simple mental model when writing your Python code in Excel and a way for you to ensure that variables are defined before you use them. In this example **A1** executes before **B1**, which is how you can guarantee that x is defined before use. You can define functions in cells as well: **A1** defines a function that squares a number: ```python # excel:python A1 = def square(x): return x**2 ``` When you reference it from cell **B1**, the value of that cell is 16: ```python # excel:python B1 = square(4) ``` It's important to recognize that any variable that you define in a cell is a _global variable_. This makes it convenient for you to define x in one cell and use it again in another cell as you saw earlier in this section. However, it is also easy for you to redefine variables in different cells accidentally. Here's an example where x is used in **A1** to hold an important result to be used later in **C1**, but is also used as a throwaway value in a loop in **B1**. Because the order of execution is from left to right, the cells are executed in the order that you see below and may not produce the result that you would have expected because **B1** is executed between **A1** and **C1**: ```python # excel:python A1 = x=compute_important_value() # excel:python B1 = for x in range(10): do_something() # excel:python C1 = compute_more_important_result_using_x(x) ``` ## Referencing cells using xl() Cells in the grid are referenced using the xl() function. This function accepts a string as a parameter that contains an Excel [cell reference](https://support.microsoft.com/en-us/office/create-or-change-a-cell-reference-c7b8b95d-c594-4488-947e-c835903cebaa) or [structured reference](https://support.microsoft.com/en-us/office/using-structured-references-with-excel-tables-f5ed2452-2337-4f71-bed3-c8ae6d2b276e). Below, **A1** contains a reference to **B1** which adds 1 to the value of **B1** and stores the result as the value of **A1**: ```python # excel:python A1 = xl("B1")+1 ``` ![[Pasted image 20231022185522.png]] The values in the grid are sent to the Python interpreter for execution along with the code in the cell. This is a _push_ model where the values are sent to Python; it is not possible to retrieve values from the grid programmatically from Python. The cell reference must be a string literal; it cannot be stored in a variable. Below are examples of valid cell references: ```python xl("A1") xl("A1:A9") xl("A1:B22") xl("Table1") xl("Table1#All") xl("Table1#Data") xl("Table1#Headers") xl("Sheet2!A1") ``` Below are examples of invalid cell references: ```python xl("A"+"1") ref="A1"; xl(ref) ``` Cell references in Excel can be fixed up when copying cells within a workbook. Recall that I discussed earlier that cell references in Excel are _relative references_ within a workbook. This means that cell references in xl() calls will be "fixed up" when that code is copied or moved within a workbook. The xl() function can reference more than a single cell. A 1D range of cells can be referenced as well, for example, xl("A1:C1") or xl("A1:A3"). ```python # excel:python D1 := type(xl("A1:C1")) ``` ![[Pasted image 20231022190206.png]] ```python # excel:python B1 := type(xl("A1:A3")) ``` ![[Pasted image 20231022190350.png]] In both cases, the xl() function returns a Pandas DataFrame object. You can pass the DataFrame to any function that can perform computations on it. For example, **B1** uses numpy average() function to compute the average of the range **A1:A5** in the sheet: ```python # excel:python B1 == np.average(xl("A1:A5")) ``` ![[Pasted image 20231022190416.png]] The xl() function can also reference 2D ranges. Consider the sheet below and the code in cell **E1**: ```python # excel:python E1 = df = xl("A1:C4", headers=True) np.average(df["b"]) ``` ![[Pasted image 20231022190621.png]] In the above example, df is assigned the Pandas DataFrame returned by the call to xl(). Note that the headers=True parameter uses the first row in the range as the column headings for the DataFrame. This lets me reference the column with the heading "b" in the call to the numpy average() function that compute the average of the values in column b. ## Returning multiple values to the grid Python can return multiple values to a cell. **A1** returns a Python tuple by value to the grid: ```python # excel:python A1 := (1,2,3) ``` ![[Pasted image 20231022190657.png]] Notice that the previous example used the := operator to instruct Excel to _spill_ the contents vertically in the grid. Standard Excel spilling behavior is in effect here. For example, if there were a value that would block spilling the result to the grid, a #SPILL! error would appear as the value of **A1** in place of the spilled values. ![[Pasted image 20231022190712.png]] A Python list will return spilled values as well: ```python # excel:python A1 := ["a","b","c"] ``` ![[Pasted image 20231022190816.png]] If you want to return multiple _computed_ values from a cell, you should collect them in a list or a tuple depending on your needs, and reference the list or tuple using the last expression in the cell: ```python # excel:python A1 := result=[] for i in range(5): result.append(i) result ``` ![[Pasted image 20231022191024.png]] A Python list of lists returns a two-dimensional grid of data: ```python # excel:python A1 := [[1,2],[3,4],[5,6]] ``` ![[Pasted image 20231022191112.png]] If you want to spill a list of values _horizontally_ you can do so by returning a list of lists that contains only a single row: ```python # excel:python A1 := [[1,2,3]] ``` ![[Pasted image 20231022191819.png]] ## Error Handling in Excel Excel propagates errors from their origin to all downstream cells that directly or indirectly depend on the origin cell. In the example below, **A1** contains a #DIV/0! error and because **A2** references **A1**, the error propagates to **A2** as well. ```excel # excel A1 = 1/0 # excel:B1 = A1+10 ``` ![[Pasted image 20231022192345.png]] You can use the IFERROR function to handle errors in Excel and prevent them from propagating to other cells by returning a specific value in lieu of the propagated error: ```excel # excel A3 = IFERROR(A2, "Error!") # excel A4 = CONCAT(A3, " But Still Calcs") ``` ![[Pasted image 20231022192433.png]] ## Error Handling in Python Unlike Excel, Python code can raise exceptions. When thinking about error handling, it's useful to think about the _direction of the error_: is it an error that originates in Python and propagates to the grid? Or is it an error that originates in the grid and propagates to Python? Let's consider a simple case in Python where there is a ZeroDivisionError in **A1**. When you run the code in the cell, an Exception fires and the Diagnostics pane automatically appears to the right, showing the Python exception type and its message. This pane contains errors caught during cell execution and a time stamp that you can use to see when the exception occurred. It also contains the output of print() statements. The pane adds newest errors at the top, pushing older errors down. If you want to clear the error messages, just click on the Clear diagnostics button on the top-right. ```python # excel:python A1 = 1/0 ``` ![[Pasted image 20231022192508.png]] You get the same behavior regardless of whether the result of the cell was returned to the grid as an Excel object or an Excel value. Here's a sheet with errors in **A1** and **A2**. The exception type and its message are printed along with the name of the sheet and the cell that the error occurred in. **A3** contains some code that writes to stdout and stderr. stderr messages are colored yellow in the diagnostic pane and stdout messages are colored white. ```python # excel:python A1 = 1/0 # excel:python A2 = assert 1==0, "Failing deliberately" # excel:python A3 = import sys print("Warning!", file=sys.stderr) print("Hello, World!") ``` ![[Pasted image 20231022192613.png]] You can use the Excel formula language ERROR.TYPE() function to detect whether a cell contains a Python error or not. #PYTHON! errors have the error value 19. The absence of an error is #N/A. ```python # excel:python A1 := 1/0 # excel:python A2 := assert 1==0, "Failing deliberately" # excel:python A3 := 1+1 ``` ```excel # excel B1 = ERROR.TYPE(A1) # excel B2 = ERROR.TYPE(A2) # excel B3 = ERROR.TYPE(A3) ``` ![[Pasted image 20231022192742.png]] ## Referencing a single cell that contains an Excel error in the grid Python cells can reference cells in the grid that contain Excel errors. Below are some examples of Excel formula language cells that contain errors in column A and Python cells that reference them in column B. Each row contains an independent example; let's examine this example one row at a time: Python in Excel automatically converts Excel errors to Python exceptions. The xl() function in **B1** references **A1** which raises a #DIV/0! error. Because the call to xl() happens inside of a Python exception block, the Python exception that represents the Excel error, excel.excelerror.Div0Error, is caught in the handler and outputs the message #DIV/0! to the Diagnostics pane. Because the exception was caught, the value of the last expression evaluated is None and that value is returned to the grid. ```python # excel:python A1 = 1/0 # excel:python B1 = try: xl("A1") except Exception as e: print(f"Exception type {type(e)}: {e}") ``` In this example **A2** raises a #NUM! error because the result of the calculation exceeds the largest floating point number that can be represented in Excel. The Python cell in **A2** references this cell but doesn't try to catch the exception, which would have been of type excel.excelerror.NumError. In this case, the unhandled Python exception propagates back to Excel and is converted back to the original #NUM! error. ```python # excel:python A2 = 1E+300*1E+100 # excel:python B2 = xl("A2") ``` ![[Pasted image 20231022192900.png]] ## Reference a range of cells that contain Excel errors The xl() function can reference more than a single cell. When it references a range of cells, that can contain errors, it will convert those incoming missing values to NaN values in the grid. ![[Pasted image 20231022193421.png]] Missing values are commonplace in many libraries. For example, Pandas represents missing values using the singleton pd.NA and numpy represents missing values using np.nan. # How Python Code is Executed ## Security Python code within the workbook runs in a hypervisor isolated container on Azure. This setup ensures two levels of isolation: first, the hypervisor boundary isolates the container from other containers running on the same physical machine; second, a network boundary isolates the container from the user's own computer. Two independent computers are involved in the experience: one runs the Excel client and the other executes the Python code. An Azure container is activated to run Python code whenever needed by the workbook. Ideally, this container remains active for the entire user session, until the workbook is closed. However, situations like unplanned interruptions or scheduled meetings can lead users to leave the container idle. To optimize the use of shared computing resources, the service automatically terminates the idle container after a 30-minute timeout, freeing up resources for other users. Excel sends both the code and the data that it operates on from the workbook to the container for execution. This is a "push model" where Excel is responsible for sending data to the container. Code executing in the container cannot request or "pull" data from the workbook. When Excel runs code within a container, there's a possibility that the code may not finish executing, such as in cases involving infinite loops. To manage this, Excel sets a default timeout of 30 seconds to automatically terminate such long-running code. However, there may be scenarios where this default timeout isn't adequate. You can adjust the timeout duration through Excel Options: ![[Pasted image 20231022193728.png]] Your Python code runs in a [**container** **image**](https://github.com/opencontainers/image-spec/blob/main/spec.md) provided by Anaconda and licensed by Microsoft. It contains ~400 preinstalled libraries that you can use simply by importing the library into Python. The container image was prepared by Anaconda as described by their [Security Practices document](https://docs.anaconda.com/free/anaconda/reference/security/). By default, Excel imports some packages for you in a script called init.py. You can see view (but not yet edit during Preview) this script from the Excel ribbon: ![[Pasted image 20231022193750.png]] ![[Pasted image 20231022193758.png]] The imports that are made for you in init.py means that you won’t need to import any of the libraries in Python cells that you write. In the future, you will be able to place all imports into this script once Excel enables editing. If you do need to import any additional libraries that you need to use. A useful tip during preview: remember that cells execute left-to-right, top-to-bottom within a sheet and sheets are executed from left to right within a Workbook. During preview, you can write all the code that imports libraries that you need in cell **A1** of the first sheet in your workbook. You can guard that code using a global flag that gets set to avoid re-initialization. TODO: code sample Each workbook carries with it an identifier for the Python Runtime Environment (PRE) that it was created in. Today the PRE is mapped only to the version of the Anaconda Python distribution that it was authored against. This ensures that when new versions of the Anaconda Python distribution are added to the service[[i]](#_edn1) the Python code running in your workbook will continue to execute using the PRE that you used when you created it. In the future, other attributes may be considered in identifying the PRE. When you share your workbook with another user and they open it, it starts in a disconnected state, that is, the container is not activated yet. When the other user performs an action in the workbook that triggers a recalc that involves a Python formula, the service launches a container with a matching PRE specified and recalc will begin. ## Performance Considerations Every time a Python cell is executed, Excel sends a message to the Python runtime hosted in Azure. That message is a JSON message that sends the code to be executed and any data that is needed by that code; this is the data referenced by xl() calls in the code. That message is received by a Jupyter kernel and executed using the Python interpreter, and the result is returned to the client either by value or by reference depending on what type of result the cell asked for. Every time a cell is executed, a new message containing data and code is sent. There is a time cost to this round trip that needs to be paid for each execution. The time taken to transmit code and data to the server leads to a general principle: **favor chunky over chatty communications**. In chunky communications you send batches of data to be calculated vs. in chatty communications where you send the data individually. For workbooks that contain a small number of Python formulas, this will likely not be noticeable. But for workbooks that contain many Python formulas, the transit time for the messages will likely dominate the amount of time needed to compute the results. Consider the common Excel pattern where you “drag down” a formula that applies to each cell in a row. Imagine that there are 1000 rows of numbers in column **A**. Column **B** contains 1000 Python formulas, all of the form: ```python # excel:python B1 = xl("A1")+1 # excel:python B2 = xl("A2")+1 ... ``` In this case, computing the values of **B1-B1000** will require 1000 calls to the Python runtime in Azure. A much better way of computing the answer is to make a single round-trip call to the Python runtime: ```python # excel:python B1 = xl("A1:A1000")+1 ``` This will pass the values of cells **A1-A1000** to the Python runtime, construct a Pandas DataFrame to hold the values, and perform a vectorized addition on the DataFrame. The results are then returned by value to the grid and spilled into cells **B1-B1000**. # Using External Data Python code running in the container cannot make outbound network requests to the Internet because requests are blocked by the network configuration on the container host. You will not be able to use Python to download data or call APIs on the Internet. However, any data that is in your workbook can be made available to Python. If you need external data, you can connect to them using the Get Data feature in Excel which downloads data into the workbook. This provides a secure and compliant way to access corporate data using existing mechanisms that have already been deployed by your organization. ![[Pasted image 20231022194327.png]] You can also download CSV files and load them as additional sheets in the workbook. Once there, you can access that data from Python. Python code cannot request data from the workbook associated with it; there is no way for malicious code that runs in your Python environment to access any data that wasn't explicitly passed to it by Excel. Cell references are immutable string constants and cannot be changed at runtime. This limits exposure to other data in the workbook that the author may not have intended the Python code to gain access to. # Appendix ## R1C1 Notation Cell references are shown to you using absolute references, that is, the letter/number format, for example, **A1** or **ZZ1000**. Internally, those cell references are stored as _relative references_, for example, “2 rows above and 5 columns to the left”. You can place [Excel in R1C1 mode](https://learn.microsoft.com/en-us/office/troubleshoot/excel/numeric-columns-and-rows) to see this for yourself. ![[Pasted image 20231022194352.png]] Below is a simple example: the first cell in row 1, column 1 contains the value 1. The second cell in row 2, column 1 contains a reference to the row above (-1) in the same column and adds 1 to it resulting in the value 2. The third cell in row 3, column 2 contains a reference to the row above (-1) and the column to the left (-1) and adds 1 to it resulting in the value 3. ```excel # excel R1C1 = 1 # excel R2C1 = R[-1]C+1 # excel R3C2 = R[-1]C[-1]+1 ``` ![[Pasted image 20231022194425.png]] Since cell references are always stored as relative references, this explains how Excel cells can be easily copied and have their cell references _adjusted_. Consider this example: ```excel # excel A1 = 42 # excel B1 = A1+1 # excel C1 = B1+1 ``` If I switch to R1C1 notation, you can see that the cell references in **B1** and **C1** are really: ```excel # excel A1 = 42 # excel B1 = RC[-1]+1 # excel C1 = RC[-1]+1 ``` When I copy the formulas to **A2**-**C2**, _the formulas don’t change_ because of relative referencing: ```excel # excel A2 = 42 # excel B2 = RC[-1]+1 # excel C2 = RC[-1]+1 ``` This is what it looks like in the grid with cell **B2** selected: ![[Pasted image 20231022194553.png]] We do something similar for Python formulas: If I copy and paste **A2** into **A3**, Excel will adjust the reference in **A3** to xl("A2"): ![[Pasted image 20231022194607.png]] If you switch the view of Excel to **R1C1 notation** using Excel Options: ![[Pasted image 20231022194620.png]] You'll see that Excel will adjust how the string literal is displayed to reflect the **R1C1** **notation**: ![[Pasted image 20231022194636.png]] ## Listing Python Libraries in the Python Runtime Environment Here’s some Python code that you can run in an Excel workbook to inspect the Python version and the libraries available in the Python Runtime Environment: ```python # excel:python A1 = from platform import python_version python_version() # excel:python A2 = import os with open('/etc/os-release') as f: for line in f: if line.startswith('NAME='): dist_name = line.strip().split('=')[1].strip('"') break dist_name # excel:python A3 = import os os.uname().release # excel:python A4 = import os mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') mem_gib = mem_bytes/(1024.**3) # excel:python A5 = import os stat = os.statvfs('/') stat.f_frsize * stat.f_blocks / 2**30 # excel:python A6 = import os cpu_cores = os.cpu_count() with open('/proc/cpuinfo', 'r') as f: cpu_threads = f.read().count('processor\t:') f"CPU Cores: {cpu_cores}, CPU Threads: {cpu_threads}" # excel:python A7 = import subprocess import json result = subprocess.run(["conda", "list", "--json"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) raw_data = json.loads(result.stdout) pd.DataFrame(raw_data) ``` ![[Pasted image 20231022194809.png]] **A1** outputs the Python version. In Preview Excel is running Python 3.9.16. **A2** retrieves the name of the Linux distribution by reading the /etc/os-release file in the container. **A3** calls os.uname() to get the name and version of the Linux kernel that the container is running on. **A4** and **A5** retrieve the amount of physical RAM and the system disk size in GB. **A6** returns the number of CPU cores and CPU threads. **A7** retrieves the list of installed conda packages in the container by executing the command conda list –json using the Python subprocess package, loads the JSON output into a Pandas DataFrame and returns the values from it by value in the grid. From the results, you can see that Python executes in an x64 architecture container with 1 CPU Core, 1 CPU Thread, 2GB RAM and a 49GB system disk. The container guest operating system is the [Microsoft CBL-Mariner Linux distribution](https://github.com/microsoft/CBL-Mariner). # Acknowledgements Steve Dower, Austin Henley, Joe McDaid, Alison McKay, Keyur Patel, Guido van Rossum, Isaac Wilson, and Jack Williams all provided helpful feedback on the content of this article that was much appreciated. I am of course responsible for any remaining errors in the document.