Tag Archive for 'R'

Data Science Time Warp Machine

Fedora 38 freezes up and crashes sometimes when using Gnome on bare metal.  This may be the result of Gnome reliability issues.  In a previous article I detailed creating a massive repo of Fedora 38, and I still have it.  I will not delete the 238GB repo because Fedora 40 is the last one with Python 2.7 in the repositories.  They elected to completely remove it in Fedora 41 and beyond.  I created some software in Python 2.7 that may never make it to Python 3 because I will be an old man by the time I could complete the conversion relative to my available time in the present day. I had migrated from bare metal to WSL with Fedora 36 a few years ago. I had created my own WSL instance using the Fedora 36 cloud init image, and then upgraded it over the years to Fedora 38 and then ceased updating it.  WSL crashes and cannot be relied upon to run tasks that require many hours of continuous processing.

WSL really was wonderful for development and running Linux applications with underlying Linux features.  I used it for development using Pycharm.  The problem is that I would often return after 12 hours and see a message that the terminal could be closed with a CTRL + D which indicated that the service had stopped for some reason.  I suspect these occurred when available RAM conflicted with the /dev/share features of Linux.  Troubleshooting it would take too long. I don’t trust the releases from the Windows store because forced updates in Windows can take features away or cause unexpected problems.  I upgraded my Windows 11 home desktop to Windows 11 Pro specifically so I could disable Windows automatic updates via group policies, service disablement, and registry modifications that fail to stop auto updates on Windows 11 Home.

To create a long use time capsule of sorts, I decided to switch to Alma Linux 8 from Fedora 38.  Alma Linux 9 follows the tradition of RHEL 9 and removes the easy support for Python 2.

I setup Alma Linux 8.10 Cerulean Leopard, installed from the KDE live DVD, and installed r Studio server to access via web browser.

edit /ect/dnf/dnf.conf and add keepcache=True

dnf install epel-release    
dnf config-manager -enable powertools    
dnf install R    
dnf install python2

The python2 install installs pip2.7 automatically. One calls pip2 via the pip2.7 command.

As regular user the following is required for a script I made because parsedatetime changed after version 2.5 and is no longer compatible with the previous versions.

pip2.7 install parsedatetime==2.5 --user

• Install rstudio-2024.12.0+467-1.rpm from direct download

• Install rstudio-server-rhel-2024.12.0-467.rpm from direct download

systemctl enable rstudio-server

Configure the firewall to allow 8787.

usermod -a -G rstudio-server <username> 
setenforce 0

The last instruction to turn off SELinux is temporary until I can ascertain the specific rules that will need modification to allow it work. With SELinux enforcing with the initial configuration, the server cannot be accessed via web browser remotely

Building an anti-bitrot bunker

When I first obtained a CompTIA A+ certification some time ago, the concept of bit rot related to what happens to software reliability as the number of updates to the system increases. Over time, software that worked in years past ceases to work as new system updates block system calls or change permissions and files that the software originally relied upon. This slow decaying of reliability was called bit rot. There are some other definitions floating around on the internet, but that is the one most relevant for me.

To prevent this and reduce the amount of time spent in sorting out the bit rot introduced to my investments by Microsoft’s proclivities, I have standardized on two operating systems for major time investments in computing that occupy my life.  Windows still has a place since I sometimes use Windows only games to play with my child.  Other than that necessity, I have built the things I rely on for use with Linux.  The two versions that I have standardized on are Debian and Fedora. Specifically Debian 12 and Fedora 38.  These are not what is used for the website, but they are the major components of my anti-bitrot infrastructure.  I am aware this may not be good security practice, but this isn’t to get me a job, this is to serve an aging man and his family reliably over time.

There are a few reasons that I selected Fedora 38 and Debian 12. Fedora 38 still has Python 2.7 in the repositories.  It was within 2 versions for upgrading from Fedora 36. Fedora 36 was what I was running in my Windows Subsystem for Linux instance, and I upgraded it to Fedora 38.   When deciding to move back to bare metal for my Linux software development and automation needs, I decided to standardize on that one.  Debian has a 32 bit version. I have both the 32-bit and 64-bit deployed in my network.  Debian 32 bit allows on to easily 32-bit builds of Java on a Linux server.  One can add the testing repository and have the latest Java in 32 bit form.  32-bit Java is necessary to run older Minecraft versions.  My family has a large set of mod-collections and old Java Minecraft instances and maps going back about 8 years. I can then run the latest JVM and the latest Minecraft on the same server because the 32bit JVM is available.  It is very annoying to try and  manage 32bit and 64bit Java virtual machines on the same host, so having it all 32-bit solves a huge problem.  One can add the Debian 11 repository and install Python2.7 if one wants to use old Python versions.  I need this old Python version for a project that I worked on over the course of the last 8 years.

The general anti-bit rot measure for Debian is to always use apt-get to install packages as this will leave the .deb file in the cache.  Then copy those .deb files on a regular basis to another location for use as a repository for other Debian installations.  This can be configured to work via a cron job.  Debian is really the only long-term viable game in town if one wants a 32-bit anti-bitrot bunker that will last into the future, over say, a ten year time horizon. Save the .deb files on a private web-server inside the network and periodically update that repository with the files copied via the sweeps from the cache directory.

For Fedora, this should work with any version.  Fedora 38 is the one I use.  Edit the /etc/dnf/dnf.conf file to say keep_cache=True, and this will save the dowloaded RPMS.  One can then build their repository using only what they need if so desired.  The other genuine long term standardization option is to mirror the entire repositories to the private webserver within one’s network.  To do this, install yumdownloader.   Then, move all of the .repo files from /etc/yum.repos.d except for one to a temporary location.  Then, go into a directory with a lot of space.  For Fedora 38, the complete mirror was 229GB.  Then use yumdownloader * and it will download all files from that repository.  Then, use yumdownloader –source * and it will download all source packages for the packages that it just downloaded.  Then, go back to /etc/yum.repos.d and switch that .repo file out with one that was moved previously, and repeat the process until all repositories have been completely downloaded.  Then, copy all downloaded RPMS to one large directory on the webserver and use createrepo to create the meta data.  Then on the client machines, create a .repo pointing to your own webserver, and move the existing .repo files in /etc/yum.repos.d to an archival location.   Then all of your installations will occur from your own webserver and all machines will have the same versions of packages.

In my case, I then install the data science specifics that I need for the automated software that I created.  The process varies slightly depending on whether the system is Fedora or Debian.

For Debian:

Add bullseye to sources.list
Install python2 via bullseye repositories
Install pip via the downloaded file from https://linuxhint.com/install-pip-on-debian-11/ that is to say,
1. wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
2. python2 ./get-pip.py
Apt-get install libcurl4-openssl-dev

R:
Install.packages(‘curl’)
Install.packages(‘fpp2’)
Install.packages(‘magrittr’)
Install.packages(‘urca’)
Install.packages(‘vars’)
Install.packages(‘psych’)
Apt-get install r-cran-rjava
Install.packages(‘rJava’)
Install.packages(‘xlsx’)
Install.packages(‘Hmisc’)
Install.packages(‘prophet’)
Install.packages(‘dplyr’)
pip2 install parsedatetime==2.5
apt-get install awscli
pip2 install boto3
apt-get install r-cran-car

Troubleshooting steps if .xlsx and others are not built:
Are all R packages installed successfully with a 0 exit status?

On Fedora, the repositories do not contain R components in the same way the Debian repositories do.  Here is the process for Fedora 38.

Used the script https://bootstrap.pypa.io/pip/2.7/get-pip.py
pip2 install parsedatetime==2.5 –user
dnf install libcurl
dnf install libcurl-devel
dnf install R
dnf install awscli
pip2 install boto3 –user
dnf install cmake

Within R:
install.packages(‘car’)
install.packages(‘curl’)
install.packages(‘fpp2′)
install.packages(‘magrittr’)
install.packages(‘urca’)
install.packages(‘vars’)
install.packages(‘psych’)
install.packages(‘rJava’)
install.packages(‘xlsx’)
install.packages(‘Hmisc’)
install.packages(‘prophet’)
install.packages(‘dplyr’)

Labeling variables in R

This great procedure makes it easy to remember what variables are related to in R. One of the troubles with exploratory data analysis is that when one has a lot of variables it can be confusing what the variable was created for originally.  Certainly code comments can help but that makes the files larger and unwieldy in some cases.  One solution for that is to add comment fields to the objects created so that we can query the object and see a description.  So, for example, we could create a time series called sales_ts, and then create a window of that, called sales_ts_window_a, and another called sales_ts_window_b, and so on for several unique spans of time.  As we move through the project we could have created numerous other variables and subsets of those variables.   We can see the details of those by using head() or tail(), but that may not be an extremely useful and clear measure.

To that end, these code segments allow applying a descriptive comment to an item and then querying that comment later via a describe command.

example_object <- "I appreciate r-cran."
# This adds a describe attribute/field to objects that can be queried.
# Could also change to some other attribute/Field other than help.
describe <- function(obj) attr(obj, "help")
# to use it, take the object and modify the "help" attribute/field.  
attr(example_object, "help") <- "This is an example comment field."
describe(example_object)

The above example refers to an example object, that could easily be sales_ts_window_a mentioned above.  So we would use the attribute command to apply our description to sales_ts_window_a.

attr(sales_ts_window_a, "help") <- "Sales for the three quarters Jan was manager"
attr(sales_ts_window_b, "help") <- "Sales for the five quarters Bob was manager"

After hours or days have passed and there are many more variables under investigation, a simple query reveals the comment.

describe(sales_ts_window_a)
[1] "Sales for the three quarters Jan was manager"

This might seem burdensome, but RStudio makes it very easy to add this via code snippets. We can create two code snippets. The first is the one that goes at the top of the file which defines the describe function that we use to read the field we apply to the comment to. Open RStudio Settings > Code > Code Snippets and add the following code. RStudio requires tabs to indent these.

snippet lblMaker
        #
        # Code and Example for Providing Descriptive Comments about Objects
        # 
        example_object <- "I appreciate r-cran."
        # This adds a describe attribute/field to objects that can be queried.
        # Could also change to some other attribute/Field other than help.
        describe <- function(obj) attr(obj, "help")
        # to use it, take the object and modify the "help" attribute/field.  
        attr(example_object, "help") <- "This is an example comment field."
        describe(example_object)

snippet lblThis
        attr(ObjectName, "help") <- "Replace this text with comment"

Now one can use the code completion to add the label maker to the top of the script. Simply start typing lblMak and hit the tab key to complete the code snippet. When wanting to label an object for future examination, start typing lblTh and hit tab to complete it and replace the objectname with the variable name and replace the string on the right with the comment. These code snippets provide a valuable way to store descriptive information about variables as they are created and set aside with potential future use.

This functionality does overlap with the built in comment functionality with a bit of a twist. The description added via this method appears at the end of the print output when typing the variable name. The built in comment function does not print out. It is also less intuitive than describe() and receiving a description.

R contains a built in describe command, but it often is not useful. Summary is the one I use most often. For a good description, I import the psych package and use psych::describe(data). Because of that, the describe method in this article is very useful. The printout appears like below with the [1]…

lu71802xbt90_tmp_dac5c795

Adding attributes other than “help” could easily be accomplished. DescribeAuthor, DescribeLocation, and other functions could be added. When using a console to program, a conversational style makes it flow better.

My Favorite Function

My favorite function of all time is varsoc in Stata.  That’s saying a lot because I have been working with computers for decades and have written software in several languages, used many different types of administrative software tool sets, and owned a lot of books with code in them.  Varsoc regresses one variable, y, upon another variable, x, and then regresses each lag of y on x to produce output that allows one to know the best fit lag for a regression model.   It allows someone analyzing time series data to immediately know that data from the several prior is a better predictor of today’s reality than more recent data.  I adore Stata for scientific analysis.  In order to use this for my big data project, I needed to automate it, and so I wrote an R vignette that would analyze 45 lags and produce the relevant test statistics. My vignette produces r2 values1, parameter estimates, and f-statistics for 45 lags of y regressed on x. The p-values are then written to a CSV file. The decision rule for a p-value is that we reject the null hypothesis if the p-value is less than or equal to α/2.2 The data comes from 5GB of CSV files that were created via Python.

Running the lags shows us the relationships between the historical prices of two securities. When we regress y on x in this case, we are regressing the price of security 2 on security 1. We then do this on a lag. The L1 of security 2 regressed on security 1’s L0. Then we regress L2 of security 2 on security 1’s L0. This occurs for 45 iterations. For example, we might find that the price of a gold ETF 44 days ago has the best relationship with the price of Apple stock today as compared to the price of that same gold ETF 12 days ago and even today. That’s an example only and not anything substantiated in the data. There will certainly be some spurious relationships. An ETF buying shares of Apple and then the same ETF’s fee going up the next month, for example. To mitigate this, the vignette uses the first difference of the logarithm so that the data is stationary. The CSVs are already produced so that unit roots are accounted for. This is a research project to identify what actually bodes well in other sectors. It runs on every listed security on the American exchanges. Every symbol is regressed on Apple. Every symbol is regressed on Microsoft, and so on. The data is stationary and unit roots are eliminated.

I initially began this project some time ago and at that time I stopped because it was going to take a solid month of continuous 12-core processing to accomplish the entire series. In retrospect, I should have let that proceed but there would have been a great tradeoff in that I couldn’t have played Roblox, The Isle, and Ark Survival Evolved with my daughter. Finally, I’ve got the research running on a new machine dedicated to that purpose. That machine uses an AMD Ryzen 5 3500 and NVMe SSD. The program is running on 6 cores in parallel. Previously, with the one month estimate, it was running concurrently on 12-cores of Westmere Xeon CPUs and storing the output in RAM instead of on an SSD. This will serve as an interesting test for the Ryzen since all six cores will be running at 100% for months on end. The operating system is OpenSuse Leap 15.2, the R version is 4.05, and the Python version is 2.7.18.

One of the reasons to write these articles is for my own memory. It gets older to remember as one gets older. These blog posts are essentially a public notebook to aid myself and others.


1  R2 is the coefficient of determination, which is the square of the Pearson correlation coefficient, r, the formula for which is ρ=β1(σx/σy), where β1 is the parameter estimate. ASCI and Unicode text does not have a circumflex, ^, on top of the β. For this documentation the objective is multiplatform long-term readability so an equation editor with specialized support for circumflexes is out of the question.

2  There is also the existence of the rejection region method. We reject the null hypothesis if the test statistic’s absolute value is greater than the critical value, which we can express with the formula Reject if |t| > tα/2,n-1