A Blast from Python Past -- Part 2

Last week I posted the first of a three-part series on basic data programming with Python. For that article, I resurrected scripts written 10 years ago that deployed core Python data structures and functions to assemble a Python list for analyzing stock market returns. While it was fun refreshing and modernizing that code, I'm now pretty spoiled working with advanced libraries like NumPy and Pandas that make data programming tasks much simpler.

This second post revolves on a brief showcasing of NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays."

In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification. In Part 1, I detail looping-like code for building the final lists; in Part 2, I pretty much simply invoke array functions and structure subscripting to complete the tasks.

Though I'm the first to acknowledge not being a NumPy expert, I had little trouble figuring out what to do with the help of stackoverflow. Indeed, those familiar with the relatively recent Pandas library for data analysis will readily adapt to the foundational NumPy programming style. Core Python structures such as lists, dictionaries, comprehensions, and iterables serve primarily to feed the NumPy/Pandas beasts.

For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using NumPy to get to the desired end state.

The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.

Add a local directory to the import path.

In [1]:
import sys

functdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)

print("\n\n")


Load the library and document function signatures. The prmeta, prarr, and blanks functions are used in this article.

In [2]:
import myfuncs as my

print(help(my))

my.blanks(2)
Help on module myfuncs:

NAME
    myfuncs

FUNCTIONS
    blanks lambda n
    
    frequenciesdf(df, fvar)
        (df - pandas dataframe; fvar - list of dataframe columns)
    
    metadf(df)
        (df - pandas dataframe)
    
    prarr(arr, howmany=3)
        (arr - list/array/dataframe); howmany -- # elements to print)
    
    prmeta(obj)
        (obj - python object)
    
    tally lambda df

FILE
    c:\data\jupyter\notebooks\functions\myfuncs.py


None



Import relevant Python libraries.

In [3]:
import os
import pandas as pd
import numpy as np
from datetime import datetime
from operator import mul
from functools import reduce

import numpy.lib.recfunctions as nlr
import urllib3
import re


my.blanks(2)


Save the current working directory. Set and migrate to the new working directory.

In [4]:
owd = os.getcwd()

nwd = "c:/data/russell/potus"
os.chdir(nwd)
print(os.getcwd())

my.blanks(2)
c:\data\russell\potus



Establish a list of url's of the CSV files to be downloaded. The two files are the year-to-date (ytd) and history (hist) of the daily Russell 3000 index levels.

In [5]:
urllst = [
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valuesytd_US3000.csv",
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valueshist_US3000.csv"
]

my.blanks(2)


Build a dictionary with attribute names and formats of the data to be downloaded/processed. Feed the NumPy beast.

In [6]:
meta = {'names' : tuple(["portfolio","date","levwodiv","levwdiv","pctchwo","pctchw"]),
        'formats' : tuple(['U15', 'datetime64[D]', 'f8','f8','f8','f8'])}

print(meta['names'][:4])
print(meta['formats'][:4])

my.blanks(2)
('portfolio', 'date', 'levwodiv', 'levwdiv')
('U15', 'datetime64[D]', 'f8', 'f8')



Instantiate the data source connections.

In [7]:
ds1 = np.DataSource(None)
ds2 = np.DataSource(None)

my.blanks(2)


Download the Russell 3000 history file using the NumPy loadtxt function. Looks much like read_csv from Pandas.

In [8]:
convert = lambda x: datetime.strptime(x,'%m/%d/%Y')

ifile1 = ds1.open(urllst[1])
hist = np.loadtxt(ifile1,delimiter=",",skiprows=1,ndmin=1,encoding='utf-8',converters={1:convert},
   dtype={'names': meta['names'][:4],
          'formats': meta['formats'][:4]}
                 )   
hist['portfolio'] = "r3000"

my.blanks(2)


Note that the hist data structure is a NumPy array -- but a special one with a uni-dimensional shape attribute since the variables are different data types. This array operates a bit like a Pandas dataframe.

In [9]:
my.prmeta(hist)
my.prarr(hist)

my.blanks(2)
<class 'numpy.ndarray'>
(3650,)
[('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8')] 

[('r3000', '2018-12-31', 2704.369326, 7733.10573 )
 ('r3000', '2018-12-28', 2681.191917, 7665.856863)
 ('r3000', '2018-12-27', 2682.958961, 7668.843455)] 

[('r3000', '2005-01-05', 1238.46738 , 2692.39825 )
 ('r3000', '2005-01-04', 1245.475031, 2707.401151)
 ('r3000', '2005-01-03', 1261.856142, 2742.74548 )]



Now read the year-to-date file and take a look.

In [10]:
convert = lambda x: datetime.strptime(x,'%m/%d/%Y')

ifile2 = ds2.open(urllst[0])
ytd = np.loadtxt(ifile2,delimiter=",",skiprows=1,ndmin=1,encoding='utf-8',converters={1:convert},
   dtype={'names': meta['names'][:4],
          'formats': meta['formats'][:4]}
                 )   
ytd['portfolio'] = "r3000"

my.blanks(2)


In [11]:
my.prmeta(ytd)
my.prarr(ytd)

my.blanks(2)
<class 'numpy.ndarray'>
(25,)
[('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8')] 

[('r3000', '2019-02-04', 2958.569355, 8471.495048)
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 )
 ('r3000', '2019-01-31', 2932.905599, 8396.8695  )] 

[('r3000', '2019-01-03', 2643.051723, 7559.582582)
 ('r3000', '2019-01-02', 2707.170883, 7741.158427)
 ('r3000', '2018-12-31', 2704.369326, 7733.10573 )]



Sort the ytd structure and delete the 2018 record that's redundant with the hist data.

In [12]:
ytd.sort(order='date')
ytd = np.delete(ytd,[0])

my.prarr(ytd)

my.blanks(2)
[('r3000', '2019-01-02', 2707.170883, 7741.158427)
 ('r3000', '2019-01-03', 2643.051723, 7559.582582)
 ('r3000', '2019-01-04', 2734.493258, 7821.127295)] 

[('r3000', '2019-01-31', 2932.905599, 8396.8695  )
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 )
 ('r3000', '2019-02-04', 2958.569355, 8471.495048)]



Concatenate hist and ytd.

In [13]:
combine = np.append(ytd,hist)
combine.sort(order='date')

my.prmeta(combine)
my.prarr(combine)

my.blanks(2)
<class 'numpy.ndarray'>
(3674,)
[('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8')] 

[('r3000', '2005-01-03', 1261.856142, 2742.74548 )
 ('r3000', '2005-01-04', 1245.475031, 2707.401151)
 ('r3000', '2005-01-05', 1238.46738 , 2692.39825 )] 

[('r3000', '2019-01-31', 2932.905599, 8396.8695  )
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 )
 ('r3000', '2019-02-04', 2958.569355, 8471.495048)]



Find and delete duplicate Russell index level records.

In [14]:
dups = np.unique(combine[['levwodiv','levwdiv']], return_index=True)

ncombine = combine[dups[1]]
ncombine.sort(order='date')

my.prmeta(ncombine)
my.prarr(ncombine)

my.blanks(2)
<class 'numpy.ndarray'>
(3545,)
[('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8')] 

[('r3000', '2005-01-03', 1261.856142, 2742.74548 )
 ('r3000', '2005-01-04', 1245.475031, 2707.401151)
 ('r3000', '2005-01-05', 1238.46738 , 2692.39825 )] 

[('r3000', '2019-01-31', 2932.905599, 8396.8695  )
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 )
 ('r3000', '2019-02-04', 2958.569355, 8471.495048)]



Compute percent change vars for each of the two index levels.

In [15]:
pctchwo = np.diff(ncombine['levwodiv'])/ncombine['levwodiv'][:-1]
pctchw = np.diff(ncombine['levwdiv'])/ncombine['levwdiv'][:-1]

my.prmeta(pctchw)
my.prarr(pctchw)

my.blanks(2)
<class 'numpy.ndarray'>
(3544,)
float64 

[-0.01288648 -0.00554144  0.00367974] 

[0.00871876 0.00170092 0.00717419]



Insert a NAN entry at the top of each array to conform with the ncombine structure.

In [16]:
pctchwo = np.insert(pctchwo, 0, np.nan)
pctchw = np.insert(pctchw, 0, np.nan)

my.prmeta(pctchw)
my.prarr(pctchw)

my.blanks(2)
<class 'numpy.ndarray'>
(3545,)
float64 

[        nan -0.01288648 -0.00554144] 

[0.00871876 0.00170092 0.00717419]



Bind the percent change attributes to ncombine to construct final.

In [17]:
final = nlr.merge_arrays([ncombine, pctchwo, pctchw],flatten=True)
final.dtype.names = meta['names']

my.prmeta(final)
my.prarr(final)

my.blanks(2)
<class 'numpy.ndarray'>
(3545,)
[('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8'), ('pctchwo', '<f8'), ('pctchw', '<f8')] 

[('r3000', '2005-01-03', 1261.856142, 2742.74548 ,         nan,         nan)
 ('r3000', '2005-01-04', 1245.475031, 2707.401151, -0.01298176, -0.01288648)
 ('r3000', '2005-01-05', 1238.46738 , 2692.39825 , -0.00562649, -0.00554144)] 

[('r3000', '2019-01-31', 2932.905599, 8396.8695  , 0.00852727, 0.00871876)
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 , 0.00158392, 0.00170092)
 ('r3000', '2019-02-04', 2958.569355, 8471.495048, 0.00715503, 0.00717419)]



Convert final to a NumPy recarray, which is even more of a Pandas dataframe-like structure.

In [18]:
final = final.view(np.recarray)

my.prmeta(final)
my.prarr(final)

my.blanks(2)
<class 'numpy.recarray'>
(3545,)
(numpy.record, [('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8'), ('pctchwo', '<f8'), ('pctchw', '<f8')]) 

[('r3000', '2005-01-03', 1261.856142, 2742.74548 ,         nan,         nan)
 ('r3000', '2005-01-04', 1245.475031, 2707.401151, -0.01298176, -0.01288648)
 ('r3000', '2005-01-05', 1238.46738 , 2692.39825 , -0.00562649, -0.00554144)] 

[('r3000', '2019-01-31', 2932.905599, 8396.8695  , 0.00852727, 0.00871876)
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 , 0.00158392, 0.00170092)
 ('r3000', '2019-02-04', 2958.569355, 8471.495048, 0.00715503, 0.00717419)]



Write a CSV file from the final recarray.

In [19]:
hdr = ",".join(final.dtype.names)
np.savetxt("r3000np.csv", final, delimiter=",",fmt="%s",header=hdr)

my.blanks(2)


Do some calcs. First up -- what would $1 invested on 12/31/2018 look like today? Darned good!

In [20]:
begin2019 = np.datetime64(datetime(2019,1,1))
slug = final[final.date>=begin2019]
slug.sort(order='date')
print("$%.2f" % round(np.prod(slug.pctchw+1),2))

my.blanks(2)
$1.10



How about $1 invested on 12/31/2017? Not so good.

In [21]:
begin2018 = np.datetime64(datetime(2018, 1, 1))
slug = final[final.date>=begin2018]
slug.sort(order='date')
print("$%.2f" % round(np.prod(slug.pctchw+1),2))

my.blanks(2)
$1.04



And 8/31/2018? Awful. Timing's everything!

In [22]:
sep2018 = np.datetime64(datetime(2018, 9, 1))

slug = final[final.date>=sep2018]
slug.sort(order='date')
print("$%.2f" % round(np.prod(slug.pctchw+1),2))
      
my.blanks(2)
$0.94



What does the subset dataframe look like? It's a recarry too.

In [23]:
my.prmeta(slug)
my.prarr(slug)

my.blanks(2)
<class 'numpy.recarray'>
(105,)
(numpy.record, [('portfolio', '<U15'), ('date', '<M8[D]'), ('levwodiv', '<f8'), ('levwdiv', '<f8'), ('pctchwo', '<f8'), ('pctchw', '<f8')]) 

[('r3000', '2018-09-04', 3165.64708 , 8994.575532, -0.00165439, -0.00157687)
 ('r3000', '2018-09-05', 3155.209519, 8965.081946, -0.00329713, -0.00327904)
 ('r3000', '2018-09-06', 3142.603548, 8931.618244, -0.00399529, -0.00373267)] 

[('r3000', '2019-01-31', 2932.905599, 8396.8695  , 0.00852727, 0.00871876)
 ('r3000', '2019-02-01', 2937.551087, 8411.15188 , 0.00158392, 0.00170092)
 ('r3000', '2019-02-04', 2958.569355, 8471.495048, 0.00715503, 0.00717419)]



Contrast the percent change calcs of NumPy with those of vanilla Python. First NumPy. Note that the functions/access are array-driven. The coder specifies what, not how.

In [24]:
pctchnp = np.diff(final.levwdiv)/final.levwdiv[:-1]

my.prmeta(pctchnp)
my.prarr(pctchnp)

my.blanks(2)
<class 'numpy.ndarray'>
(3544,)
float64 

[-0.01288648 -0.00554144  0.00367974] 

[0.00871876 0.00170092 0.00717419]



Next, vanilla Python. The calculations here are procedural using a list comprehension.

In [25]:
levelvp = final.levwdiv.tolist()
pctchvp = [round((levelvp[i]/levelvp[i-1]-1),8) for i in range(1,len(levelvp))]

my.prarr(pctchvp)

my.blanks(2)
[-0.01288648, -0.00554144, 0.00367974] 

[0.00871876, 0.00170092, 0.00717419]



Next time: Pandas. Hint: it'll be easy.