Hello friends, it is episode 104 of the Our Weekly Highlights podcast.
It is right in the thick of the holiday season, so I hope all of you are enjoying your either
time off from office or just enjoying time with family, but hey, we're here to give you
a little our flavored entertainment for the next few minutes.
My name is Eric Nance and I am as always joined by my awesome co-host, Mike Thomas.
Mike, how are you doing this fine day?
I'm doing great, Eric.
I'm doing better than you by the looks of it, but we don't have any video for the audience
and it's a function of having young kids during flu and cold season, so just for everybody
out there.
I was sick a couple of weeks ago and Eric's sick now this week, but we still power through
because that's what we do for you.
Yes, it is all for you, the listeners, and yes, it is the season of giving, but I didn't
want to have this be given in my house, but yes, you can't control all that, but hey,
my voice is still working for the next few minutes and we're going to make it work here.
Oh yeah, let me check my notes here real quick, Mike, who's the curator?
Him again?
He's still here?
Yes, it is me.
It is me.
I was a curator this week before all this flu stuff set in, so at least I got the issue
out in one piece, but of course, I can't do any of my curation duties without the tremendous
help of our weekly team who make it so much easier to automate a few of the things at
least and grab the awesome links that you all send via PRs and the RSS feeds and it's
always a lot of fun to put all this together, so my big thanks to our curator team and contributors
like you all around the world for making issue 2022 week 50 happen.
So let's get right into it.
So when we utilize a major web service that doesn't have an upfront monetary cost, well,
I think it's become more common now, especially in today's world that the currency in this
case may not be so much fiscal dollars, it's our data and the hope is that the service
will be good stewards of our data and the future directions of the platform are aligned
with our principles.
At the very least, having the ability to grab the data from these walls of the service for
our own archival and perhaps even analysis purposes can always give us a little peace
of mind.
So if you've been following the tech sector, you likely know now that Elon Musk has recently
purchased Twitter and my goodness, the discourse around this has been strong to say the very
least, hence it is a great timing here that in our first highlight, Garret Gayden Bowie,
who is a data science educator and developer at Posit, has authored a fantastic blog post
on how you can unleash your R skills onto your own Twitter data archive for our first
highlight here.
Now there's a lot to this data wrangling and visualization adventure.
So we're going to cover like the major points here, but we won't be able to do the post
justice, you definitely want to look at it.
And we're going to start with what can be a perplexing issue with file data formats.
Now the good news is the data are JSON format.
Now anybody that's developed web stuff like the APIs or even done things with Shiny, JSON
is a pretty comfortable format to deal with.
Now these files on the other hand have a rather strange declaration at the top for assigning
objects in a namespace that's browser related.
That makes no sense even when I say it.
But this is the case where data is not always perfect, right?
So Garret writes some nice utility functions that clean all this stuff up and then makes
it reusable so that he can use a very nice, you know, pipe like workflow that's been made
famous by the tidy verse family of packages.
And those handy processing functions are included throughout the blog post.
So you can just take those and run with it for your own Twitter archive.
Of course, that's just half the battle, so to speak, right?
Once you're able to get the data into R, then he has some additional functions to get it
into a tidy, you know, tabular format.
We're dealing with things like nested list again, another, you know, minor hierarchical
complications.
But again, great use of the per package and others in the family of the tidy verse to
make the processing once you figure it out, quite elegant to get through.
Now, after the data are assembled, it's time for another visit to the ggplot2 corner.
And not only do these plots utilize a fun custom theme for the blog post gives me a
kind of like a, I don't know, like a sci-fi techie vibe.
It's a really cool theme here.
But these plots are interactive folks using custom tool tips that are powered by David
Gohell's ggiraff package.
This is one of those cases where Mike and I are not sure how to pronounce a package
name.
So David, if you are listening and you want to make corrections, you know how to find
us.
We'll tell you later in the episode.
You say ggiraff, I'll say ggiraff and one of us will be right.
That's right.
All the bases.
Let's cover our bases here.
Shall we?
Yeah.
No one makes it easy for us to pronounce these things here.
But anyway, the first display that Garrick makes is tweets per month, which reveal a
few insights into Garrick's usage based on key events in his life, which I can definitely
resonate with.
And then in another plot, we see what looks to be a pretty neat log linear relationship
between his popular tweets based on retweets and likes, which was kind of fascinating.
He personally is thankful that the tweets he has that get a lot of likes are the ones
that kind of make him feel good on the inside, so to speak, that he's sharing something useful
out there.
But this blog post has so much more.
There's other little nuggets here that are definitely insightful because this data archive
from Twitter gives you a lot more than just things about retweets and likes, Mike.
So why don't you take us through some of the things you found in the post here?
Absolutely.
One thing that I really loved is Garrick's use of polar area diagrams.
They're a chart type that I very rarely use because I feel like it's not often that I
have data that lends itself well to that type of chart.
However, in the blog post, Garrick creates absolutely stunning polar area diagrams to
show sort of a histogram of the time of day that he tweets, faceted by the day of the
week.
There's a separate plot for each day of the week, and the length of sort of the slice
of pie in these polar area diagrams represents how many tweets he typically tweeted out during
that particular time of day.
Another thing that really stood out to me are his use of tool tips along all of these
charts, and I think that's something that the ggiraffe package does very well with just
a little bit of HTML formatting that you can see in his code that he puts alongside the
plots in this blog, which is really nice because if you're like me, some of these other interactive
visualization packages really only allow you to create custom tool tips via JavaScript,
and you're not super comfortable with JavaScript yet or as comfortable as you would like to
be, and you're much more comfortable with HTML.
It's nice to have that option to create these beautiful tool tips and include HTML that
he sort of just glues together in these particular ggiraffe plots, which is really nice.
One place that these really nice HTML tool tips shine is in Garrick's analysis of the
Twitter advertising information.
This is data that I didn't even know we could get from the Twitter API, data around promoted
tweets or ads that have been shown to us on our timeline, and Garrick puts this information
together in this really nice horizontal bar plot where he shows the ad interactions by
advertisers.
In his case, Prime Video was the advertiser that showed up on his timeline the most.
It had 91 promoted tweet impressions, and then Prime Video had 92 promoted tweet engagements.
He has these bulleted lists under each tool tip that show exactly the different engagements
or the top five engagements related to that particular advertiser.
For each bar on the chart, which include Apple TV Plus, Action Network, it looks like a lot
of TV advertisers, but he's also got Microsoft, PNC Bank.
I thought that that was really interesting that that is data that we actually have access
to, which I think is nice because it's sort of, for us data scientists, a way to get paid
back, not financially, but paid back for the fact that we have to deal with these promoted
tweets and these ads in the middle of our timeline.
Obviously, it's a beautiful blog post in the fact that we have all of the code snippets,
which just reading the code underneath all of these charts, I learned some stuff from,
so I think everybody can potentially get something out of here between the data prep and the
visualization code and the HTML that he includes.
A really, really nicely done blog post that I certainly encourage everybody to take a
look at.
It's great to start off the highlights this week with a strong database post.
The visuals are amazing here, and I think both of you and I were remarking in the little
bit of a pre-show here that we both want to take a bigger, strong look at Gigi IRAF, that's
my version of it, to supercharge our plots that it can work well for both Shiny and outside
Shiny like in this R Markdown or other, perhaps Corridor as well.
Lots of interesting nuggets here in both the visualization and the processing because I
know Garrick's had to deal with some really unwieldy data in the past from, in this case
it's Twitter, but I've always had some fun conversation on some of the wrangling adventures
he's had in the past, so it's great to learn through his example and yeah, you'll want
to read this through a couple of times because there's a lot to unpack here, but it's all
very engaging and yeah, his bit of a call to action at the beginning is if you are concerned
about the future direction of that said Twitter platform, get your archive now, you never
know when you need it.
Yes the blog post, I think first sentences, Twitter finds itself in an dot dot dot interesting
dot dot dot transition period and I couldn't have said it better myself.
Very diplomatic way of saying it, so credit to Garrick for that great tone there, yes.
And now we're going to transition to another story that admittedly on the surface may not
look like a lot, but it could be a very fundamental shift for especially some of us in the life
sciences space.
The journey I'm about to summarize hopefully will resonate with any of you in our audience
trying to bring innovation to legacy processes via open source and in particular R itself.
As part of my external collaborations, it was about a year and a half ago that I joined
a cross pharma industry working group under the R consortium umbrella.
For those of you that aren't aware, R consortium is based in the Linux Foundation to help provide
financial support for R projects that are enhancing the R project itself or enhancing
the industry use of R. And it was about a year and a half ago that a work stream for
what we call clinical submissions in clinical trials using R was sanctioned.
And so I joined that and we've developed multiple pilots to prove out that an R base analysis
result could be delivered to regulators like the FDA using their submission transfer guidelines
and system, which I'm about to touch on a little bit here.
Now speaking of regulators like the FDA, but a huge part of this working group has been
their participation to now sharing and real time feedback on our ideas without their efforts.
We would not have achieved what I'm about to tell you.
So that's a real success story here is that this was full cooperation amongst those of
us in the pharma space, but also those of us on the other side of the proverbial wall
here, the regulator side, which is, I think, kind of like a new trend that we're going
to see in this industry going forward.
So last year, the working group was successful in transferring R based statistical outputs
are commonly referred to as tables, listings, and graphs to the FDA and a mock clinical
submission package using sanctioned data that was ready for mock use.
Now moving to this year, we up the ante, so to speak, for pilot two.
Now Mike, what truly game changing domain in the R landscape and help us interact and
share dynamic summaries on a web based format?
It's got to be shiny, Eric.
Yes, it is.
And hence we set out to create what amounts to a relatively simple shiny app that surfaces
the outputs of the first pilot in a completely self-contained reproducible bundle that the
FDA reviewers could install and execute on their respective systems.
This typically Windows laptops.
Now when you think of a self-contained reproducible bundle in R, I don't know about you, Mike,
but when I think of that, immediately what comes to mind are packages.
Naturally, I set out to bundle this application into, wait for it, Golem, because once you
go Golem, you can never go back.
I'm sorry I couldn't resist, but that's not the full story here.
Specifically, not just having the app as a package as a way to kind of set perhaps precedence
in the future, we needed a way to convert the source of the app package into literally
text files so it could fit within the transfer protocol.
This is where it can get really hairy, but this is again, at least a bit of a success
story in terms of collaboration.
Some very innovative colleagues in pharma have authored what's called the package light
package that's been authored by colleagues at Merck, which takes care of taking these
R scripts, whether it's in a package or just a folder or whatever have you, transfer them
into these text files, we feed it through the transfer protocol, and then this package
can then reassemble into real R scripts in a real package structure.
Yes, this is what I would consider jumping through a major loophole, but you got to start
somewhere.
This was quite a learning experience for me, but I'm happy to say that after all of our
efforts throughout the year, we had a successful submission of the app to the FDA just recently
as November.
But I am going to link to the GitHub repository where the app lives, and which you also find
the app itself, which again, nothing fancy, but it was successfully transferred to FDA,
hopefully setting precedence that Shiny can be a very valuable part of future submission
packages.
Shiny already has a major presence for many of us in life sciences, because we're ways
to interpret analysis results or we're surfacing novel algorithms.
And this pilot was a critical first step to making a clinical submission in Shiny technically
possible in the current landscape and sending the seeds, if you will, to even bigger innovations
hopefully next year.
So gratifying effort.
I will admit it's been a lot of work to get to this point, but we all have to start somewhere
and I'm hoping that the fruits of this labor will be realized especially next year as this
becomes hopefully more routine.
Well Eric, I know this is your baby.
When you go to the GitHub link and go to the repository, your name is all over it.
So just a gigantic congratulations to you.
It's a great step in the right direction for open source being used in life sciences and
highly regulated industries such as life sciences where open sources had trouble gaining footing.
And I think this not only is a great accomplishment in the context of what exactly you were trying
to accomplish with this submission, but I think it's a great accomplishment and testimony
to the progress that open source has made in highly regulated industries.
So this is really, really exciting for me to see.
All of our life sciences clients that we have at Ketchbrook, almost all of those projects
have been Shiny related.
So that's another really cool thing to see that the life sciences space is really strong
in their adoption and use and belief in Shiny.
So I really just hope to see this continue.
I hope that in the future the so to speak hoops that you had to jump through to get
this submitted successfully are less than the number of hoops that you had to go through
for this particular submission and that we can start to see maybe more of a linear path
in providing our packages, Docker containers, whatever it may be, to regulators such as
the FDA.
But it's really encouraging to hear as well that it was such a collaborative effort between
you and the FDA.
And obviously that's what made it successful, but that sort of tells the story that there
is not only buy-in from your side, but there's buy-in on the other side of the table as well.
And that's super important, I think, to the success of open source.
So kudos, hats off to you, Eric, something to be really, really proud of and hats off
to everyone involved and everyone that really believes in this effort, I think, of open
source and life sciences and highly regulated industries, because it's going to change outcomes.
It will.
The power that we have in the open source ecosystem between R, Python, Julia, whatever
you want to call it, you have the entire open source community working on these problems,
which is something that did not exist before.
And that's going to change lives and it's going to change outcomes.
I really do believe that.
Well, that's a very kind words, Mike, and this was certainly a team effort across the
board with my colleagues at the various other pharmas like Rosh and Merck and everyone else.
And then, yeah, FDA had a big seat to this table.
And yes, next year, we have additional pilots in store.
And one of them that I'm very interested in, I mentioned it a little bit earlier.
Wouldn't it be nice if we could just send these apps as a Docker container or easy way
just to run it with one command?
That's what we're going to try and do next year.
Can't make promises yet, but we already have like the little bit of seeds planted to explore
this next year.
So stay tuned in this space, it could get even more exciting, technically speaking too.
Now I don't know about you, but after that, after summarizing that saga, I need a little
breather, so to speak.
So we're going to, we're going to have fun with this last highlight.
And before I, before we get there, yes, the streak is over.
Albert did not make the highlights this week, but you know what?
I'm sure we'll see him again soon.
But in any event, Mike, you're going to take me back to an old game when I was a kid that
I played in the playground.
How does rock paper scissors fit into all this?
Well, I mean, the fact that Albert didn't make the highlights this week might have something
to do with the curator.
I don't know, Albert.
We'll have to talk about that.
I'm just kidding.
He will be back soon.
I have no doubt, but onto our last highlight of scoring rock paper scissors.
So I shamefully, very shamefully have not participated in Advent of Code yet, but I
do follow along and I love seeing everyone's solutions to these puzzles.
And if you don't know about Advent of Code, it's a daily coding puzzle, kind of like Wordle
for data scientists, I would say, and software engineers.
And again, Wordle shamefully is another one that I don't do.
So maybe that's why I don't do Advent of Code either.
But Advent of Code happens during the month of December.
And the other day, the Advent of Code puzzle was to build some sort of a function that
scores or outputs a win, lose, or draw result given two inputs that represent the two players'
moves in a game of rock paper scissors.
So the inputs would be Eric gave rock and Mike gave scissors.
And they're asking you to provide some sort of function that tells you who wins, either
Eric or Mike, given those two inputs.
So obviously, there's only a few different combinations of what could happen during a
game of rock paper scissors, given that you have two participants and each one has three
different options of what they could provide.
So I think it's nine total if I'm doing nine different combinations total that you need
to provide the outcomes for.
So Advent of Code gives you the input, and they tell you what the output should be.
And your job is to build, again, this thing in the middle that turns the input into the
output.
So I'm not sure if this is part of the Advent of Code unwritten rules or not, but I feel
like most of the solutions I've ever seen to Advent of Code always use base R. They
don't import any packages.
Maybe that's just what I've seen in the past through a very small sample size.
But this is the case in Tristan Maher's blog post that showcases his solution to this rock
paper scissors Advent of Code puzzle from day two, I think December 2nd.
He provides a really elegant solution involving nested lists.
So the first level of the list is what user one would throw.
So say rock.
And then the second element of the list would be the possible responses from the other participant,
from the other player.
So it could be either rock, which would result in a draw, scissors, which would result in
a win, right, because rock beats scissors, and then paper, which would result in a loss,
right?
So you do that for each of the combinations, and you have sort of this nested list that
has three tiers at the top.
And then underneath each of those individual tiers would be three tiers of a second list.
So my gut sort of really expected to see per package used, but it wasn't the case, right?
This is all base R that Tristan used.
And it led me to seeing a few base R functions, which I actually hadn't been familiar with.
There's a function called get element, which seems somewhat equivalent to the pluck function
from the per package for extracting an element from a list instead of having to build up
that bracketed one, one to extract the first element of the first list.
So really interesting, kind of lighthearted post and really nicely, concisely done answer
by Tristan to this day two advent of code puzzle that they had.
And I don't know, it's just fun for me to kind of do a little brain teaser with R and
maybe to get away from your daily grind a little bit and exercise some of those other
muscles that I think will help you be a better R coder in the long run.
So I would love to, I'm not making any promises here on the podcast that I can hold myself
to, but maybe I'll try to do at least one advent of code puzzle.
I do have some time freeing up and I'll let us know on the next podcast episode how that
went.
All right.
Way to hold yourself accountable, buddy.
One thing that I noticed, and again, really interesting read of the power of lists, like
lists are kind of like my MVP of objects and they are a language.
But the elegant thing here is that especially even now to some degree, I find myself catching
into an if else trap or I do a lot of if else if statements and that can be pretty unwieldy.
So taking advantage of how you can frame the problem differently, like what TJ does here,
I think is just a great kind of like meta principle that sometimes if you take a step back before
you do like your quote unquote tried and true solution, you might arrive at insights that
you didn't think were going to be possible.
Now TJ is already a master of lists.
He even references some of the way he uses them in his R Markdown compilations with Knitter.
He's a power user of that.
So I definitely lean some insights how I can make lists even more of an MVP in my daily
work too.
So lots of interesting ideas and yeah, I can't say that I've had a lot of time for advent
of code, but maybe I will pick up on that and there are definitely some really influential
members of the R community that, you know, it's kind of like their Superbowl almost they
wait for this time of year and they crunch away.
You often see David Robinson get on top of the leaderboard for submitting his solution
quite quickly on the R side of things and yeah, it's interesting to see the dichotomy
or I should say the trend in most of the time, like you said, Mike, these are base R solutions
and maybe that's just because they want to make it as easy as possible for others in
the community to run this on their own setups, which again is an admirable thing.
But I think also using base R, you know, kind of forces you to really up your game, so to
speak on programming logic too, because there are certain things that other packages take
away in a good way, I should say some of the pain points of coding these up in base R.
So knowing kind of what's on the inside of those solutions is another kind of interesting
way to add in a code can challenge us to explore that.
So yeah, I had a lot of fun reading TJ's posts really nicely done.
Absolutely.
Yeah, I think it's something that I've said once or twice before, but you know, we love
the package ecosystem in R especially.
That's what makes it so strong.
But I think spending some time with what we have available in base R might surprise you.
Some of the utilities that we do have available in there that you may take for granted because
we use, you know, the tidyverse every day that leverages a lot of these base R packages.
So I think it's base R utilities, I should say.
So I think it's good to have some perspective on both sides of the aisle in terms of the
package ecosystem and what comes out of the box within the R install.
Yep.
And I'll tell you what's not surprising.
Well, it's another fantastic issue of R Weekly, of course, we've got lots of good stuff in
here.
And I'm not just saying that because I was a curator this week.
We always have good stuff here.
So it'll take a couple of minutes here to share our additional finds here.
And this is a bit of a plug going back to some of the life signs trend here.
But I'm happy to say that after all the editing and crunching and saving video files that
I did a few weeks ago, the R Pharma Conference 2022 recordings are now available on YouTube.
So you can catch any of the presentations you missed and also highly regarded workshops.
To me, you're going to learn so much by watching those workshops.
And I'm not just saying that because I did want to.
These are really innovative stuff, all from quarto to using observable, putting automated
testing in your shiny apps, maybe building production apps to lots of interesting things
here to watch.
So definitely check out the playlist that we'll have in the supplements.
But in particular in this issue, we link to Robert Gentleman's keynote, that name should
sound familiar.
He's a co-founder of R itself.
That was a huge achievement for us to have him speak at the conference.
So definitely watch his keynote on some of the technical challenges that he's been facing
in his research group at Harvard with utilizing R on high dimensional complicated genetic
and bioinformatic data and some of the calls to action that he has for R can really be
taken to whole new levels.
So even a little bit of the history of R, which I never get tired of hearing, it's always
great to hear from the source.
So that was an amazing talk.
And again, every talk is amazing.
So definitely check those videos out.
Yes, absolutely.
And I will say that I was on hand for the shiny in production building production grade
shiny apps workshop done by Eric.
And it is absolutely worth rewatching if you were not there as well.
It was phenomenal.
But one other video I guess that I will call out here is Jacqueline Nolis is lightning
talk I think it's called from norm comp 2022, which the actual live talks start tomorrow.
So very, very exciting.
Check out norm comp if you haven't already.
But the lightning talks were pre recorded up on YouTube now.
And Jacqueline just does a hilarious, fantastic, relatable video that details her analysis
of really taking a look into sunset, I believe and sunrise data in Alaska specifically and
building trying to build a really nice Gigi plot but just seeing all sorts of crazy, wonky
things across time zones.
And she uses props in the video.
It's fantastically done.
It's only a few minutes long.
So I would I would if you're looking for something lighthearted, relatable, from a data wrangling
perspective, I would definitely check out that particular video.
A couple others, maybe that I'll just highlight really, really fast our blog post that says
please avoid detect cores in your R packages.
I guess the recommendation is to move away from the detect cores function in the parallel
package to the available cores function in the parallel Lee package.
So that's a blog post if you're doing a lot of parallel computing across multiple cores,
I would recommend checking out just in case you did not know that that blog post existed.
And then I'll throw one more out there that the base serve package, which is Bayesian
survival regression, got an update, I believe, and that is something that I use quite a bit
in my modeling.
So it sort of spoke to me.
Very good.
Yeah, I have a lot of Bayesian statisticians in my group at the day job.
And I know they love those kind of packages.
But yes, Jacqueline's talk, I was rolling after watching that the first time and I had to
I was watching it while my kid was trying to sleep and I meant I had to leave the room
because I was laughing out loud and I was like, oh, no, I'm going to wake him up.
It was hilarious.
So Jacqueline and her patented style absolutely nails that talk.
And as someone who dealt with messy daytime stuff a year and a half ago in my Twitch stream
calendar, I can definitely understand some of the pain points that she had to deal with
on that.
So kudos to her for finally figuring it out because that would have taken me probably
a year to figure all that stuff out.
Yes.
Yes.
Daytime's not fun.
Nope, they're not.
And time zone conversions are even worse.
But you didn't hear that from me.
What you do are going to hear from me is that we always welcome your feedback to the show.
So you can definitely get in touch with us and get in touch with the art weekly project
in multiple ways.
Of course, go to our weekly dot org, you're going to find the current issue and all the
archive of previous issues linked to all the podcast episodes.
And if you want to contribute a poll request, we have a GitHub repo already linked with
the upcoming issue draft ready for a little markdown magic coming from you.
If you want to share a great blog post, new package, video, anything that is supercharging
the art community, we'd love to see it.
And if you want to get in touch with us directly, well, our weekly is on mastodon now.
We are at our weekly at fastidon.org if you want to get in touch with us there and also
your friendly host here on mastodon and Twitter as well.
Although maybe I'll be downloading my Twitter archive, just in case.
But I'm still on there with at the our cast and also I'm on mastodon with at our podcast
at podcast index on social Mike, where can they find you online?
I am still hanging around Twitter at Mike underscore Ketchbrook K-E-T-C-H-B-R-O-O-K
and I'm also on mastodon at Mike underscore Thomas at fastidon dot social.
You got it.
Did I get it this time?
You got it.
Yes.
It's coming natural for both of us that will eventually happen.
Yes.
But yes, please get in touch with us.
We love hearing from you.
All feedback is welcome.
And if you want to have a little fun in your little podcast playing device, you can get
yourself a new podcast app at newpodcastapps.com and give us a little boost to show your love
for the show.
More details on that in the supplements of the show notes.
Well with that, my voice finally lasted this long.
I better quit while I'm ahead, so to speak.
We're going to close up shop of our weekly highlights episode 104 and we'll be back with
another episode either next week or soon.
Stay tuned.