DataFrame And Rolling Window In Java

Background
I need some data structure which models sheets in excel, which can hold data like excel does, and perform calculations like excel as well. I’ve tried to run the strategy from excel, but running regression and simulation would easily kill the spreadsheet, because, think about it, there are 300 stocks for calculating indicators, then each day there’s ranking, allocation, and rebalancing.

The calculation logic is straight forward, as I can easily do the prototypes in excel. However, running regression for a period of 5-10years would require help from a proper programming language. Python is one way, but after trying uqer and ricequant (the Chinese version of quantopian and yes they have market data for Chinese stocks) for couple of weeks, I gave up in the end, because the python session would die after running a few iterations of simulation.

So I decided to write a Java version of Pandas, which provides very narrowed features which would suit for my own purpose.

Pandas & Data Structures
Python has popular structures designed to work with it. This entry is trying to borrow the ideas from the two popular data structure (Series & DataFrame) and give myself some ideas when implementing them in Java.

Series
one dimensional array, with labels (index, or keys). holding types can be value, string and objects. The backing structure in java will be a Hashtable, e.g. LinkedHashMap.

Constructor:
1. Array, the keys will be array’s index.
2. Dictionary, the keys will be dictionary keys. (or two arrays with same length, one as key, the other as values.)
3. Scalar value, the keys will be sequential.

Accessor:
1. retrieve a single element via integer based index: series.get(0)
2. retrieve a sub-series via series.get(4,6)
3. retrieve head/tail: series.head(3), series.tail(3)
4. retrieve via label. series.get(key)
5. retrieve a sub-series via series.get(keys)

DataFrame (with reference from here):
Two dimensional data structure, like a sheet in excel, where it has rows and columns. The backing structure in java will be Table (from Guava).
1. Data Type: Columns might be in different types, for example, first column are dates, second columns are doubles.
2. Labels (index, keys) for rows and columns
3. Arithmetic operations on rows and columns

Constructor:
1. rowKeys (by default it’s sequence), colKeys(by default it’s sequence), Data
2. Data type: Lists, Dictionary, Series, Arrays, another DataFrame
2.1. Lists: DataFrame(List list), DataFrame(List lists), eg. DataFrame([1,2,3,4,5]), which will give rowKey, colKey automatically
2.2. Arrays: works similar to lists.
data=[[‘jack’,30],[‘mary’,25],[‘clare’,20]]
DataFrame(data, columns=[‘name’,’age’])
Note that Lists/Arrays should all be in same size.
2.3. Series:
series1 = {‘name’:[‘jack’,’mary’,’clare’]}
series2 = {‘age’:[30,25,20]}
DataFrame([series1,series2], rowKey=[‘row1’, ‘row2’, ‘row3′])
2.4. Dictionary of series (usually, a dictionary of columns)
…
series1={”:[a:’jack’, b:’mary’, c:’clare’]]}
series2={”:[a:30, b:25, c:20]}
dict={name:series1, age:series2}
DataFrame(dict)

Accessors/Modifier:
1. column seletion: df.column(‘age’)
2. column addition: df.addColumn(Dictionary aSeries), where the added series should have same keys of DataFrame’s rowKeys.
3. column deletion: df.delColumn(columnKey)
4. row selection by RowKey: df.row(‘row1’)
5. row selection by index based integer: df.row(0), df.head(5), df.tail(5), df.row(3,6)
6. row addition: pd.append(Dictionary aRowSeries), where the added series should have same keys for DataFrame’s colKeys.
7. row deletion: df.delRow(0), df.delRow(‘row1’)
8. Dimension: df.dimension(), Direction as X|Y
9. size: total elements
10. values: list of elements from dataframe
11. transpose
12. count/sum/mean/std/min/max

Iterations:
1. item-wise iterations.
2. row wise iterations.

Window Functions (with reference from here):
1. df.Rolling(window=3).mean()
With window’s feature, I can implement most of the indicators that I’m active using.

Reference:
1. A tutorial on Pandas Series:
https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm

2. Pandas Introduction To Data Structure
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-series-or-dicts

https://www.programcreek.com/java-api-examples/index.php?source_dir=joinery-master/src/test/java/joinery/DataFrameTimeseriesTest.java

https://cardillo.github.io/joinery/v1.8/api/reference/joinery/DataFrame.html

https://stackoverflow.com/questions/20540831/java-object-analogue-to-r-data-frame

this guy seems update the project quite actively?
https://github.com/netzwerg/paleo

tutorial with 11 bullet points:

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

implement some operations for dataframe.

https://dzone.com/articles/pandas-tutorial-dataframe-basics

play on ricequant and revisit dataframe a bit before design in java.

https://www.ricequant.com/research/user/user_337904/notebooks/%E5%8D%81%E5%88%86%E9%92%9F%E6%90%9E%E5%AE%9Apandas.ipynb