Data Analysis
Below is a showcase of some of my most recent and expansive data projects. As with many such endeavors, I always end up learning much more than I expected with each project. The full blog posts, also available on my Medium page, aim to be entertaining as well as informative. My GitHub includes all the source code, datasets, and every resulting chart and visualization. My Tableau profile can be viewed here.
Books are Datasets: Mapping 12 Sacred Texts with Python and D3.js
Summary
In this project I used data visualization and natural language processing (NLP) techniques to reveal deeper connections within and between 12 religious texts. The texts analyzed include the Bible, Quran, Midrash, Tao Te Ching, Dhammapada, Tibetan Book of the Dead, Book of Enoch, Bhagavad Gita, and others. This project involved complex data cleaning, and the use of Python and D3.js to extract key insights from the texts.
Objective
To analyze a diverse collection of sacred texts and uncover patterns, themes, and connections using NLP techniques and data visualization tools. The goal was to understand these texts beyond their surface meanings and compare them across traditions.
Methodology
Data Collection: Sacred texts were sourced from public domain repositories. Some texts, like the Midrash, required API use for extraction.
Data Cleaning: Texts were cleaned of footnotes, commentaries, and formatting issues. Special attention was paid to identifying significant entities and terms specific to each text.
NLP Techniques:
Word Clouds and Frequency Distribution Charts to highlight key concepts.
Phrase Collocation Analysis to uncover common word pairings.
Entity Frequency Charts and Chord Diagrams using D3.js to visualize relationships between entities.
Cosine Similarity Analysis: Heatmaps were used to compare the overall thematic and linguistic overlap between texts.
Key Findings
Entity Frequency: Across the texts, some terms such as “God” held dominant positions, while others revealed more nuanced ideas. Nearly all religious texts exhibit an extensive deification of concepts thought to be mundane, elevating concepts such as Grace and Eternity to the status of proper nouns.
Visual Patterns: Chord diagrams revealed holistic connections between the entities within the texts, while a cosine similarity analysis highlighted the similarities between the various books. The Abrahamic texts of the Bible, Midrash, and Quran were shown to be deeply related. In addtion, Morals and Dogma, the handbook of the Freemasons, was shown to be intimately connected to every other selection.
Unique Insights: Morals and Dogma emerged as the most interconnected text, sharing entities with nearly all other texts. This strongly suggests that it incorporates numerous seemingly disparate elements from various traditions.
Conclusion
The analysis of sacred texts using data science methods not only reveals connections between religions but also offers a fresh perspective on these cornerstone cultural works. The project opens up possibilities for further analysis using similar techniques on other forms of literature as well.
For More Information
Analyzing and Visualizing Reddit with NLP
Summary
Methodology
Conclusion
In this project, I used Python to scrape Reddit’s API and perform sentiment analysis on millions of user-generated comments, and post text and titles, across several subreddits. This section outlines my process, challenges, and key findings, demonstrating my ability to work with large datasets and apply natural language processing techniques.
Objective
The goal was to explore the use of web scraping and sentiment analysis on a vast social media dataset to reveal patterns in user sentiment and discover interesting trends across communities such as r/AskReddit, r/wallstreetbets, and r/cryptocurrency. I first tested my scraping skills on BBC’s website, and later applied targeted scraping to assess the potential of scraping on the social media sentiment analysis of public figures, namely Kamala Harris, Donald Trump, and Taylor Swift.
Tools Used: Python, PRAW, spaCy, TextBlob, VADER, Matplotlib, Pandas
Approach: I scraped millions of Reddit posts and comments, cleaned the data, and applied sentiment analysis models to visualize user sentiment and extract key phrases. I initially performed a cursory scrape of BBC’s website as well.
Challenges: Dealing with messy text data, handling Unicode/emoji issues, and cleaning Boolean values were among the most difficult parts of the process.
Key Findings
Sentiment Analysis: The sentiment of most Reddit communities trends at least slightly if not overwhelmingly positive, despite the large amount of negative or controversial content. Targeted sentiment analysis on public personalities, products, or ideas can yield very useful results. Scraping mainstream news websites to gauge sentiment is rarely fruitful, but does provide a surface-level view of the headlines of the day. I also happened upon a strange anomaly in the data which is suggestive of cybercrime activity occurring on Reddit.
Visualizations: Word clouds and sentiment charts were generated to provide a visual summary of the top phrases and overall community sentiment. Some examples from the blog post are shown below, including an example of mojibake.
This project allowed me to sharpen my skills in data scraping, natural language processing, and sentiment analysis while handling a large, messy dataset. Future work may involve exploring targeted scraping more deeply, and refining the models or making use of others, to better capture nuanced social media language.
For More Information
Summary
How Random is the Lottery, Really?
In this project, I used Python to analyze large lottery datasets and determine just how random the lottery numbers are. By leveraging statistical methods like the Chi-square test, I was able to visualize the distribution of winning numbers and assess whether patterns exist in Powerball, Quick Draw, and other lottery games.
Objective
The goal was to test the randomness of lottery results using Chi-square analysis and explore whether any numbers are more likely to be drawn than others.
Methodology
Tools Used: Python, NumPy, SciPy, Matplotlib, Tableau, Excel
Approach: I collected historical Powerball and Quick Draw data, ran Chi-square tests to compare expected and observed distributions, and visualized the results to assess the randomness of the numbers.
Key Concepts: The Chi-square test helps determine how likely it is that any observed differences in number frequency are due to chance rather than an underlying pattern.
Key Findings
Some numbers appear to be drawn more often than others among different games, but rigorous statistical analysis shows that the deviations from randomness are actually minimal. For example, Powerball numbers show seemingly large deviations in distribution, with numbers such as 61 being drawn more frequently than others (almost twice as much as the least-drawn number) in a dataset of approximately 5600 drawings and therefore 400,000 data points. However, the overall p-value of 0.22, derived from the Chi-square test, suggests that this is well within the bounds of random chance. For Quick Draw, the dataset of which contained approximately 1 million drawings and therefore 20 million data points, more points than any other dataset, the results were uniformly distributed, demonstrating the high degree of randomness in lottery number generation.
Conclusion
While there are some apparent patterns, statistical analysis shows that lottery numbers are sufficiently random by design. Additional analysis of other lotteries has helped confirm these findings. While there may be some possibility of very slightly tending a given game in one’s favor, the evidence strongly suggests that winning the lottery remains a game of pure chance.