Hello friends, we are here with episode 103 of the Our Weekly Highlights podcast.
My name is Eric Nance and as always I am delighted for you to join us from wherever you are listening
around the world for some more great art content for your listening pleasure.
And it's always a pleasure to be joined by my awesome co-host who's rocking a bit of
new hardware here, Mike Thomas.
Mike, how are you doing today?
Doing very well, Eric.
I do have a new podcasting mic set up, so I have gotten pretty serious over here and
maybe listeners will notice the difference.
Maybe they won't, but I'm excited.
Yep.
You know it gets serious when you get the dedicated mic.
Yes, that is very nicely done and for those of you who are listening to audio, you can't
see it, but Mike's microphone is really solid, so I'm really impressed.
Mike's mic.
Exactly.
I should make a meme out of that perhaps.
Yes.
Yeah, so we're here to discuss Issue for this episode 103 that's been curated by Jonathan
Carroll, another longtime Our Weekly member of our curation team.
He's been a huge help with getting a lot of the infrastructure stuff back up and running
and getting access to various things and he and I have some ideas on how we can implement
some more automation with some of our new social media exploits with Macedon, but in
any event, he did a great job in this issue and as always, he had tremendous help from
our fellow Our Weekly team members and contributors like you all around the world.
And we begin our episode today with big news on arguably the foundational pillar of the
Tidyverse, specifically the Tidyverse team at Posit is preparing a grand release of dplyr
1.1.0 for January and Posit software engineer Davis Vaughn has authored a new blog post
to put the spotlight on some major new features and updates to response to community feedback.
And I'll start this off with an improvement I'm particularly excited about and that's
a new way to perform flexible joins of datasets using the new join underscore buy function.
Now what does this really mean?
What does this allow you to do?
Well, it's not just doing those typical case of joins where you have the variable in common
and it's more of a quote unquote equality fashion.
But now with this new join underscore buy, you have the ability to introduce custom expressions
for different types of joins, such as those dealing with inequality, maybe a rolling join
or an overlap join, which are nicely defined in the blog post.
And Davis also does a terrific job with including a realistic example of assigning employees
to a company party and how you can use a hybrid of these new join capabilities to make it
happen, which would not have been possible without a lot of custom manipulation in previous
dplyr functions.
And the write up in the blog post is really comprehensive, going through each step of
the join process and incrementally improving on it to get to that final stage.
And this example particularly hit home for me because a few years ago, I was tasked with
creating an app to let someone planning a company event at the day job, assign attendees
to tables at this company event and ensuring that at least one higher level manager or
executive was present at each table and that the attendees at a given table were representing
multiple functions or teams.
Now of course I use Shiny for that because I use Shiny for all the things, right?
But the onus was on the app user to manually create those assignments, even in the app
or within a spreadsheet that I could upload.
Well, think of it like this, if I could go back in time or if I have to add, if I have
to revise this app in the future, I could use the new version of dplyr to give them
a little button that says, hey, you do the assignment for me and I'll proofread it later
with a combination of all these joins perhaps or these flexible joins.
So I'm really excited to try that out if I get that opportunity again.
And I even have another opportunity perhaps to use this where I've seen some resources
and what's called the Odyssey project for harnessing real world health data.
And they have some really complicated SQL joins, a lot of inequality joins, a lot of
custom expressions inside.
So now with dplyr 1.1.0, I might be able to implement or translate some of those joins
into a dplyr syntax.
So I'm thinking of giving that a shot as well.
Now it is important to note that the highly regarded data.table package, which is very
immensely popular among many in the R community, has supported these quote, non-equi joins
for many years, but that was part of the inspiration for the tidyverse team to implement the new
join by function.
And it has been one of the most highly requested features in dplyr going all the way back
to 2016.
So it is really cool to see this new feature.
Those join improvements are potentially huge.
And I think I know a few places in my own workflow where I'm going to try to implement
those as soon as this package hits CRAN.
Another place that is a huge update in 1.1.0 to dplyr, stop me, Eric, have you ever done
a group by?
Again, some other dplyr function and then an ungroup.
How many times in your life do you think you've done that?
Too many to count, buddy, too many to count, right?
So what's coming in 1.1.0 is temporary grouping with an additional argument in verbs that
work by group, such as mutate, summarize, filter, and slice.
You've gained a new experimental argument,.by, which allows for inline and temporary
grouping.
So it's pretty powerful.
It's going to save you at least one line of code.
So if in your previous scripts you had written something like empty cars piped to group by
cylinder, and then summarize eight miles per gallon equals some mpg, and then you had to
do an ungroup after that, all you'll have to do now is empty cars, pipe it in to summarize,
mpg equals some mpg, comma, second argument is.by, and then cylinder.
So it can just be two lines of code.
The output is not grouped, so again that.by just creates a temporary grouping to perform
the calculation that summarize, mutate, filter, slice calculation and returns you an output
that is not grouped.
So you do not have to worry about tacking on your ungroup verb at the end of your pipeline.
And I mean, most of the time, probably for 90% of the use cases I have, that's exactly
what I'm doing.
I'm having to ungroup after I am doing that group by operation.
So this is going to save me a lot of code, save me a lot of time as well.
They're calling it an experimental argument.
I hope that it sticks around and we can expect to see it in 1.1.0, but I think it is going
to be a huge game changer in the tidyverse for all of us in all of our ETL and data manipulation
pipelines.
Yeah, I definitely see massive time savings with this operation.
I am very thankful that Davis was upfront about the ordering piece of it.
Now there are some in the community that are kind of concerned about this, that some may
have had pipelines in the past where they took advantage of the ordering that happened
in group by in other post-summarization or post-mutate operations.
What I'll have link in the show notes is a toot from deposit Macedon account where there
was some interesting discussion from some familiar faces actually to R Weekly on maybe
some of the caveats that might need to be thought about.
But I think as long as people are aware that the.by is not going to change the ordering
that was done by default when the data set was imported, then I think as long as you
know what to expect, I think it's definitely manageable.
But again, credit to POS for putting this out there now instead of waiting until the
CRAN release of 1.1.0 at all and then surprising people.
I think that's a very important thing in software development, especially in open source software
development for a package like dplyr that is used so widely across many different data
science workflows to be upfront about this and not surprise people.
So again, credit to Davis and the tidyverse team that put this together.
But it is a very exciting release nonetheless.
Yes.
And maybe just one or two more things I will note about this blog post, specifically starting
with the.by argument.
Just like group by, you can group by multiple columns.
You don't necessarily just have to provide one column to this.by argument.
You can use multiple columns, which is great.
And group by won't ever disappear.
That verb will never disappear.
So you don't necessarily have to worry about this impacting any of your production work.
And if you don't necessarily want to switch right now, you do not have to switch right
now.
Two other updates coming in 1.1.0, the arrange function is getting some improvements with
respect to character vectors.
And there is a new function I believe called reframe, which is a generalization of summarize.
So check out the blog post for more info on those other two improvements.
And certainly there are more features than what are summarized in the post.
So there are definitely links in the post to additional features from the GitHub repo.
And certainly if you have concerns about some of the new changes, that's what issue boards
are for.
And I've already seen a few issues posted after the release of this blog post to clarify
a few things.
So if you do have concerns and you see maybe a gap in testing the dev release, hey, that's
what feedback's for, right?
So I highly encourage people to check it out, especially if you're writing a package, an
app or whatever important pipeline you have that could make use of these new features.
So really nicely summarized by Davis and certainly, like I said, I could relate to that example
in the SQL joins because I was thinking, oh, you came up with that?
Where was this a few years ago?
Could have made my life easier.
But yeah, really exciting to see here.
Well, it is that time of year, Mike, we're approaching that we're in the holiday season
basically now it's the end of the year.
And you're probably being inundated like I am with various countdowns or top 10 lists
or whatever have you.
And apparently, I'm not one of them.
But if you use Spotify, they'll listen to your music streaming, you probably received
your own personalized list of your most listened to songs this year.
And yes, many, many people are tweeting that out in the various social media platforms.
In fact, I have a link to Travis Gertz humorous LinkedIn post about apparently a little bit
of arranging and summarization is like the new data science hotness and in these summaries,
I guess.
No, I'm kidding.
I know this can be a lot of fun.
Hey, we're in the our weekly highlights podcast, right?
How can we put a little our magic on this?
Well, the very talented Nicola Rennie, a data scientist at jumping rivers, enters the highlights
podcast once again, with how she pivoted from her listening habits to driving a distinctly
our stats flavored brapt of her most used functions in the year.
Now, this was a very fun exercise in the blog post on both code and introspection.
And of course, a little bit of data munging and visualization at the end.
So Nicola starts of importing all of her file paths and our scripts related to her tidy
Tuesday submissions, a great way to have kind of a calendar like chronological order of
how she's been using our this year.
And she also utilize Nicholas Cooper's NCMISCR package, easy for me to say, with a handy
function called list dot functions dot to that file.
That's a mouthful, isn't it?
But it does what it says, right?
It's going to take a set of file paths, look at them and and literally give you a list
of the functions and packages that were called in that script.
So Nicola combined that with some per iteration to assemble a tidy tipple of the function
frequencies or you might say the number of times it was used in her scripts.
Now of course, this wouldn't be complete without a top notch visualization, right?
Well of course, ggplot2 enters the game here and Nicola proceeds to assemble an infographic
of the top five functions that were called in these scripts.
Now this is quite meta in and of itself because three of the top five functions are indeed,
wait for it, from ggplot2 with AES being used 47 times across her Tidy Tuesday scripts.
So there you go, usually Tidy Tuesday has some kind of visualization, right?
So that's not very surprising, but hey, now you got quantifiable evidence in her case
that ggplot2 is an MVP of her Tidy Tuesday adventures.
This is a really entertaining read and again, very easy to follow too.
So there's ample opportunities wherever you want to do this for say your Tidy Tuesday
submissions or some I'm thinking about.
Say I have a directory of all my Shiny app code, what are the most common input widgets
I use or what are the most common reactive constructs I use?
I could see lots of fun doing that.
I'm a huge fan of Spotify Wrapped.
It's like one of the easiest data products ever made.
It's literally just a count and an order by.
I think I saw some people last year tweeting that Spotify Wrapped is the coolest AI I've
ever seen, which is just hilarious because it's like a two line sequel probably.
Yeah.
ChadGBT at AIN, but hey, you know what?
You got to start somewhere.
I put out a tweet today that I thought was great, hasn't gotten a whole lot of love,
but for the 90s babies out there like me, we know that the original ChadGBT was Smarter
Child if you're ever on AIM.
So I'm just going to leave that out there.
ChadGBT isn't exciting me that much.
I've seen this before.
Anywho, people absolutely love Spotify Wrapped.
So it was really cool to see Nicola implement this in an R spin.
And that pipeline of functions that she uses that you mentioned, Eric, to find the most
used functions across all your R scripts in a directory, that's really useful.
I can actually see myself using that maybe to try to write an internal package and understand
what are some of these functions that I'm just using all of the time.
So I don't know.
I feel like there are a couple of different interesting ways that I might be able to leverage
that particular logic that she's put together, which is really nice.
I'm looking forward to creating my own Spotify Wrapped using Nicola's code this afternoon.
I would love to see how my most used functions have changed over time.
Ooh, I love that idea.
That might be scary to look at.
And probably, you know, if I looked in a year from now, it'd be a lot of movement from group
buys and ungroups away into the.py argument.
So we'll see.
One of my favorite parts of the viz that she puts together is the color change at the top
that makes it look like someone took a bite out of the visual.
And she does it with some cool sort of random number generation with a particular seed,
as well as the cumulative sum function, just really sort of brilliant code to create what
looks like someone taking a bite out of the corner of the visual.
So I highly recommend you checking it out, not a ton of ggplot code, like a surprisingly
small amount of ggplot code to create this beautiful visual.
So I am absolutely excited to test it out myself, and I would encourage everybody else
to test it out themselves and tweet their results.
Yeah.
And the key part is that there were no custom other programs to help with that visual, right?
That was all in ggplot2, all in R itself.
So yes, another win in the ggplot2 notch, if you will, for creating infographics that
you would never guess were produced by R. Another fantastic visual.
And yes, even his top lists are also fair game here.
Really, really great read.
And yeah, if I turn loose on that set of code for the R scripts I made for my dissertation
compared to now, I don't even want to know.
Oh, gosh, no.
I've not looked at that code for many, many years, but it's on my hard drive here in the
basement somewhere.
I don't have the guts to look at that, but it would be a fun exercise, nonetheless.
Keep it tucked far, far away.
Yes, yes.
No one's going to hack that little mess over there, so thank goodness.
Well, speaking of little hacking, if you will, our last highlight talks about one of the
Mike's and I favorite topics, and that's Shiny development, of course, and how you might
be able to do a slight bit of hacking, but yet make a huge improvement to your app quality.
And what we're talking about here is that out of the tin, Shiny and its related package
ecosystem comes with so many features out of the box.
Of course, we have the huge selection of input widgets.
We have reactivity.
We have these great wrapper packages to give you new UIs, new ways of interaction.
And case in point, one of those being BS4-, one of Mike's and I's favorite packages that
create dashboards that look so professional, so polished.
So of course, shout out to David Grange and for making BS4- and the R interface suite.
But what if you're using that and you're doing what we may call a client-side interaction,
but you're still losing a bit of what happened in that interaction to the server side.
That's where you might be able to plug that hole with just a little bit of custom JavaScript.
And that's what for the third episode in a row, returning to the R Wiki Highlights, Albert
Rapp has another awesome blog post on how he enhanced one of his Shiny apps that was
serving a dashboard with JavaScript while fully admitting he is not a JavaScript expert.
And if I had known this, well, I kind of knew this was always possible in my early days
of Shiny, but I felt scared about it.
So if you've ever been intimidated about the idea of custom JavaScript in your apps, you
definitely need to read Albert's post here.
This is a terrific example of one of the features of BS4- when you have these little cards or
boxes in your app and then letting the user determine the order of them, you just click
and drag it around like you would anything else on a computer interface.
But the issue was the text that was inside these boxes was not being preserved in that
new order that the user did through this rearranging.
How do we get that out?
And that's where he took the moment to play around with a little bit of JavaScript in
the developer console, to me, for all the 80s geeks out there, going into the JavaScript
debugger console and like Chrome or your browser of interest is like my favorite movie of Tron
when Flynn goes into the game to hack in the MCP.
Yes, yes.
There's a 68.71% chance you're right.
But it's not so intimidating once you know what to look for, what element you need to
get, then the rest of the post goes through just a little bit of JavaScript code to grab
the contents of that text input from that card and then even do some more iterative
programming to get those values and a map like framework.
And then the hook, of course, is to make that manipulation available to your Shiny app on
the server side.
So in essence, he's made a custom input that he can observe upon or put in any reactive
or other construct to get that new ordering of the text that's available in those boxes.
Obviously, this is a post you want to read probably a couple of times if you're new to
this.
But the way he outlines this from the investigation, harnessing on the inputs needed or the text
inputs needed, and then bringing that back into Shiny, it's a great use case for just
how easy it is to get started in this, but it gives you that little seed of, I can take
this much further in other situations where I don't have an R package.
I don't have a built-in Shiny app function that will do this for me.
It's a great way to know kind of the inside of how Shiny works.
So again, great, great post by Albert here, and I really enjoyed reading it.
Yes, Albert just keeps coming back with fantastic content, a lot of data viz, but this time
it's around Shiny and JavaScript, which is just incredible.
If anybody knows what kind of coffee he drinks or energy drink he likes to drink to be this
productive and pump out this incredible RStats content, please let me know because I am going
to buy as much of it as I can.
You covered a lot of sort of the problem statement of what Albert was trying to accomplish here
with reordering these text area input boxes using the sortable function in bs for dash,
which is really, really nice that allows you to drag and drop different elements, but wanting
to get out the user's input into those boxes in the order that the boxes were dragged around
in, and this required some JavaScript.
One thing that Albert highlights and shows is how to play around with your web browser's
console by essentially right clicking, clicking on the developer or the inspect button in your
browser and then navigating to that console area and being able to actually have a blinking
cursor that allows you to enter in some particular command and get a response from the browser.
Really, really cool, something that I have not done enough of to be honest, so this was
a great introductory post for me and for anyone else who is looking to maybe get their hands
dirty with a little bit of JavaScript and getting into the developer portal in your
web browser, which was a really nice introductory way that Albert went about explaining this
in his blog post.
He talks about a few different ways to incorporate JavaScript code in your Shiny app, and two
of them involve the ShinyJS package, which is I believe a Dean Attali special, did I
get that wrong?
No, you are exactly right, Dean Attali has been one of my MVPs of my early Shiny career
and still to this day I use all of his packages in one way, shape, or form.
Absolutely, I do as well, so there's a couple ways to go about doing that.
You can just define your JavaScript code in a long text string and include your JavaScript
code via text variables with the ShinyJS package.
You can read your JavaScript code from a particular file using the extend ShinyJS function, or
you can incorporate JavaScript code actually without ShinyJS in a couple different ways.
If you have a particular button, you can set the onclick attribute of that, you can sneak
the JavaScript code into your app by placing the tags, dollar sign, script into the UI,
that function to the UI, and you don't necessarily need ShinyJS to do that, so there's a bunch
of different ways to go about accomplishing incorporating some JavaScript into your Shiny
app.
Typically, the way that I have in the past used JavaScript is really with visualization
libraries like Echarts for R or Reactable, they allow you to write a little bit of custom
JavaScript to do something beyond what those packages offer in just their R functions,
which is really nice, but I think going the next step will be a big deal for me here to
actually write some custom JavaScript outside of one of those packages and include it in
my app, and what a really nice use case that Albert had to showcase how to do this in a
few different ways.
Yeah, and if you are inspired by this exploration, like I certainly was, and you're wondering,
hey, where can I go to kind of get more ideas of what I can do?
What's the potential here for me?
We'll have linked in the show notes two excellent freely available resources.
We have David Grangin's outstanding Shiny user interfaces package, which is also available
online for free.
And great chapters on JavaScript interaction with Shiny apps, and then John Kuhn, who of
course is the author of Echarts for R, shout out to John.
We have a great link to his JavaScript for R book.
That's another great way to learn about the potential here.
So certainly you can go down quite a few rabbit holes here, but I think they're more than
worth it, especially when you get into situations like me where it's not just this app I'm making
for a handful of people, it's going to be for executives or it's our key leaders that
want the best they can get out of a user experience.
So really awesome, awesome posts, Albert.
And yeah, as Mike said, whatever you're, whatever you're consuming or in your routine, they'll
crank all this out.
Send it our way, please.
We'd love to have it.
Yes, please.
It's awesome.
And what else is awesome?
Well, of course it's the rest of the issue of R Weekly.
There's a ton of awesome content that Jonathan has put together for us and we'll mention
our additional finds here.
Now of course, for me, in this episode, you've heard a lot about the tidy verse for good
reason of course, but that's not the only verse in this episode.
I want to give a well-deserved shout out to my fellow Life Sciences R enthusiasts who
have spearheaded the Pharma verse, which is a true testament to the power of collaboration
and open source that's putting the power of R into generating clinical results and data
processing.
And what we have linked is a great post from the Posit blog, which is a summary of how
the Pharma verse started, which has some great times at the R Pharma Conference and how it's
just become a huge and important part of the full story of how we're getting together in
our industry to make things easier to cooperate together instead of in silos trying to do our
own version of these clinical reporting or processing needs.
Certainly ways to go, but it's really exciting to see the Pharma verse really take a lot
of foothold into what we're doing in Life Sciences and also a teaser for an upcoming
hackathon for one of the key packages called Admiral that's happening in January.
We'll have a link to that in the supplements as well.
Micah, what did you find?
There's a great post called Setting Up and Exploring a Larger Than Memory Aero Table.
And it is all about handling larger than memory data with Aero and DuckDB, which are two of
my new favorite tools for essentially handling anything related to ETL and large data, if
you will.
There's been a ton of buzz lately about DuckDB as well.
I think some of the Aero, Buzz Aero has been around a little bit longer, so maybe it's
died off a little bit, but I think it's still a fantastic project that I use all the time.
And we are now starting to see that there are intersections of both of these packages
and both of these technologies that we can leverage to make our queries even faster and
make our data prep even faster, which is really incredible.
And the authors really do a great job of showcasing the different advantages of using Aero, DuckDB,
or both in your ETL pipelines.
They run 100 trials of some code, some simulation code, and plot in a really nice ggplot the
time that it took to run that code with different combinations of Aero and DuckDB.
So I highly recommend checking out this post if, like me, you are interested in the most
cutting edge data manipulation libraries and technologies that we have available today.
Yeah.
Aero and that ecosystem is something I want to pay a lot more attention to in my upcoming
efforts in 2023 for some really complicated and voluminous data processing.
So I'm really excited to see that tutorial as well.
Yeah.
Excellent.
Excellent find there.
And for feedback to the show, I want to offer a little correction from last week.
And when we were talking about that great tutorial about using Rselenium for web scraping,
I admit I was under the impression that the Rselenium package had found a new maintainer
because I had seen a grand release very recently in the fall.
That's actually not the case.
Rselenium is still looking for an active maintainer.
And hopefully we will have somebody in the community step up and take reins of that very
important package.
So I want to give a great thank you to Cohen Huffkins on Macedon for letting us know about
that.
So, yep.
So if you're interested in Rselenium and taking reins of a very important package, certainly
get in touch with the GitHub repository for the package.
There's an issue that is looking for an active maintainer.
So again, thank you, Cohen, for that feedback.
We don't have any boosts this week, but if you're interested in sending a little love
to the show, if you're getting any value back from listening to this, you can send value
back to us in any way you like.
But you can do that easily with a new podcast app, which you can get at newpodcastapps.com
and send us a little boost and have a little fun with us.
As for where you can find us, well, we have R Weekly available on Macedon with at R Weekly
at Fossedon.org.
And also you can find me still somewhat on Twitter with at the Rcast and also on Macedon
with at our podcast at podcastintex.social.
And Mike, where can they find you?
Yes, you can find me on Twitter, still hanging around at Mike underscore Ketchbrook, or you
can find me on Mastodon at Mike underscore Thomas at Fossedon.org.
Thank you.
Yes.
One of these days it'll become natural for us to say this with repetition at will, of
course.
But yeah, please get in touch with us and again, we're always happy to hear feedback
or corrections or suggestions.
Nothing is off the table for us who want to make this podcast the best for all of you
out there listening.
Well, it's been a lot of fun as always, Mike.
And again, really enjoy seeing the new hardware.
You're a serious podcaster now.
That's awesome to see.
Absolutely.
Absolutely.
You know, it only took me 50 episodes or whatever it's been so far, but looking forward to what's
to come.
You bet.
Yep.
We've got a fantastic year, I'm sure coming up in 2023, but we still got to see how to
finish up 2022.
So that means that we will be back with episode 104 of the R Weekly Holidays podcast next
week.