Hello friends, welcome to episode 102 of the R Weekly Highlights podcast.
My name is Eric Nance and we are closing up the year rapidly approaching, but we're not
closing up our awesome looks at the R Weekly Highlights for the week.
And you know, can never, ever, ever do this anymore without my awesome co-host, Mike Thomas.
Mike, how have you been today?
I've been great, Eric, trying to balance working on one screen while watching the World Cup
on the other screen and so far going pretty well.
Yes, I was watching some World Cup action at my kid's practice yesterday for the USA
game when you're listening to this and a lot of people are glued to the TVs on that one
while kids are randomly skating in figure eights around the ice.
So yeah, interesting atmosphere nonetheless.
Good stuff and yeah, if you're enjoying the World Cup out there or you're more enjoying
the R content, we have a little bit of everything today as we always do.
And our issue this week has been curated by Tony Elhabar, one of my great colleagues
on the R Weekly team that I met in person at RStudioCon earlier this year.
Lots of fun conversations there.
And as always, thank you to Tony and also thanks to all of you around the world for
your poll requests and contributions to R Weekly and for our fellow curators.
So let's dive right into it.
So I've always felt spoiled enough as thinking about developing packages just by the existence
of reliable and time-saving packages such as Use This and DevTools that eliminate so
much of the manual effort that I've had in pain points when I created our package from
scratch.
Now, even with those powerful helpers, we are starting to see, I would say, some trends
in the last year, especially as we've been doing this podcast, more examples of additional
packages in the R ecosystem that can even put a little bit of extra on top, little extra
bit of help that takes this experience to another level and give you less excuses to
potentially put off that extra bit of development that maybe you were dreading.
Case in point, test.
Yes, I said it, test.
Sometimes in my past experience, the thing I wait till the very end to bolt onto my packages,
I admit I'm being honest here and I'm trying to be more prospective on it.
Now, of course, we're very familiar with test that.
That's certainly not going away.
But there's another interesting aspect to how you could get those tests generated while
you are also following a good practice of developing your examples of how your functions
work in a package.
And that's where our first highlight comes in with the doc test package version 0.1 that
was just released by David Hugh Jones, who's associate professor in the School of Economics
at the University of East Anglia.
And what's interesting about this is that if you've been following best practices with
the dev tools and the like, you've likely been writing your documentation of a function
with our oxygen tags, where you put the little at param and the description of the parameter
and all that.
Well, what doc test gives you is another oxygen tag to start building in your test cases.
That seems kind of mind blowing to me, but it actually works.
I don't know.
Are you as amazed about this as I am?
There's a lot of magic at play here, and I was absolutely blown away by this package.
I guess I didn't realize that other languages like Python and Rust already have this built
in.
But as you said, doc test enables you to write code and tags in your oxygen documentation
that automatically generate unit tests.
It's incredible.
This package doc test that isn't on Korean yet.
So install the dev version from GitHub and give it a whirl, but the GitHub page within
the read me as well as the vignettes and the package down site have some great examples
that make it pretty clear how exactly you would do this.
And it's pretty straightforward.
These tags that you would include look a lot like the tests that you would write in a test
app.
I see myself using doc test as a huge efficiency gain tool for functions that have a couple
simple tests, perhaps.
I'm not sure I'm entirely sold on using it for functions that have more and especially
more complex testing logic where I'm having to do some sort of random sampling and ensuring
distributions or things like that in my test.
But it seems like that could be difficult to do entirely with just these oxygen tags.
But if I'm not mistaken, I think I could use kind of a hybrid approach and only use doc
test on the functions where I choose to use it while going maybe the traditional or manual
testing route on other functions and not including the oxygen tag doc test, but just writing
up the full testing logic within my own test that file.
So pretty incredible.
Nice to see that this is covered in the highlights, again, version 1.0, 98% code coverage.
So in a testing related package, it's great to see such good code coverage.
And I'm very excited to watch this and see where it goes.
And I will absolutely be one of the users pulling down the dev version here from GitHub
and playing around with it and seeing how much efficiency I can gain here.
And it definitely encouraged our listeners to do the same.
Yeah.
And one great point you made is that this is not an all or nothing thing, right?
I mean, you can definitely, the way I view it is similar to you, my package will likely
have some utility functions that are definitely more nimble or a little less big in scope
where your tests are going to be written pretty concisely.
And that's where doc testing can be a huge help there.
But then if you have a more sophisticated function that maybe is doing an API call or
anything like that, you can still opt to do it the quote unquote old fashioned way with
test that you're not confined to doing all of one approach or all of the other.
And that's a microcosm of a lot of the way these package helpers work.
You can take the bits you want.
Like I don't have to use the wrappers to generate data for my package, but I want to maybe from
use this and the like, but this, I think there's going to be cases where you will get overwhelmed
pretty quickly for a more sophisticated code around your test case.
But I think again, having another easier entry point to a best practice is stuff that really
helps you as a developer, especially new to package development, make that leap a little
bit easier.
And then as you get more experienced and you can choose to stick with it, you can choose
to go full bore test that or do the hybrid like I'm talking about.
So I think for my next internal package I make, I'm going to take a hybrid approach
with doc tests as it matures and hopefully gets a grand release or our universe release
and in their future and then I'll start using that in bigger projects, but certainly great
time to play with it.
Great time to get feedback as always.
And that's of course the virtue of open source with having all these available at our disposal
for lots of use cases.
So really well done.
Hopefully this will get some good reception and certainly happy to see it.
Absolutely.
There's a couple of nuances.
Maybe I'll throw out there that if you want to use doc test in your package, should alter
your package's description file to add the DT underscore Rocklet to Roxygen as well as
add doc tests to the suggests as a dependency in your package.
That's what's recommended here.
So as you get started with this, maybe keep that in mind as well.
Very good.
And that's always seen magic to kind of the things you can put in that description file
to just kind of instruct how a package imports these resources.
That's something I'm going to try and learn about next year because I've seen this many
approaches with just our oxygen in general, but even other things like even RM has some
flies you can put in there too.
So that's another rabbit hole that I'm going to go down one way or another next year for
sure.
Yes.
Absolutely love the reminds me of the attachment package that is used in Gollum to amend the
description files with the dependencies that just get read from parsing the code within
your package saves me an outrageous amount of time.
Oh yeah.
You can, you can never quantify the, the awesome savings you get for, you know, taking away
a lot of this manual effort.
And another way you might be able to take away manual effort is we mentioned many times
on this podcast before, but you know, it's 2022 those, that data you want is probably
not going to be on the CSV file all the time.
We're seeing a rich amount of resources of data available on the web.
And one way to get that is for some good old fashioned scraping.
Scraping is the art of using either programming or tools or a combo of both to take what's
being served on a website.
Maybe it's a table of data.
Maybe it's a list of things, massaging a little bit, importing it into your favorite language
and crunching out your data science skills with it.
And that's where our next highlight comes in, where you might have a case where the
data is not just being served in a quote unquote static way.
You might think of a page on Wikipedia, for example, that has like a little table there
of maybe, I don't know, country flags or whatever have you.
And you can use traditional packages like HTTR or Rvest to interrogate that.
Or if it has an API, that's definitely a great approach there too.
But you don't always get that sometimes.
Sometimes those pages have hidden away behind that glitzy little table they serve up on
the webpage, some JavaScript magic to pull that data in.
That's where you got to go a little bit outside the box.
And that's where a package called our selenium comes in.
I've had a lot of experience with this, so I'll be sharing my thoughts shortly.
But what we're talking about here is a great slide deck made by Etienne Bakker.
Hopefully I'm saying that right.
A PhD student from Lisser-Lunksenberg with a really comprehensive tutorial about not
just what our selenium is, but why you would want to use it and maybe a couple pitfalls
along the way.
So maybe you could dive through this, Mike, what did you learn from Etienne's presentation
here?
This is a really holistic blog post covering just about every aspect of web scraping in
R that I can think of.
It's actually not a blog post.
It's a slide deck that appears to be made with Kordo.
It's beautiful.
There you go.
Kordo all the time now.
Yes.
So I always appreciate any sort of educational content that doesn't assume you already have
everything installed.
And Etienne calls out the fact that you may have installation issues that you have to
hurdle before you can get going with our selenium.
And he does a really nice job of explaining the different workarounds.
You can try to troubleshoot any of the issues that you run into around ensuring that you
have JavaScript installed on your machine, ensuring that you have administrative access
to Firefox, if that's the browser that you decide to use for automated scraping purposes.
Like you said, Eric, if the site is static, our vest might be all you need in terms of
packages.
But if the site is dynamic, where it does have those JavaScript bells and whistles,
you are likely going to need our selenium to interact with the page.
And you can do just about anything with our selenium that you could do manually in terms
of scrolling down a page, going back or forward on a site, clicking on a button or a link,
editing a text box, adding text or navigating, interacting with a particular widget on that
page.
So it is pretty incredible.
It does feel like magic if you've never done any web scraping before with our selenium.
It does feel like magic that you are programmatically interacting with the web page.
So this slide deck is loaded with excellent examples and use cases involving both static
and dynamic sites, scraping both static and dynamic sites.
There are even links to additional blog posts around web scraping with our selenium, particularly
some around parallel scraping that I saw, like utilizing multiple browsers simultaneously,
something that I've never tried before, but I do have some long running scraping jobs
that might be interesting use cases for taking a look at parallel scraping.
And one of those, I will shout out one of those links is to Ivan Mullane's blog post
from 2020 on web scraping with our selenium and our vest.
Ivan is a very good friend of mine and a brilliant art developer.
And we will link to that blog post as well in the show notes.
But it's really nice to see Eddie and not only including a ton of fantastic content
that he developed himself, but also referencing some of the other folks in this space who
have written educational content on web scraping with our as well.
So it's definitely a resource that I think belongs maybe in the big book of R or belongs
wherever you store important R resources that you come across beyond just your mastodon
or Twitter favorites button.
You bet.
And also with our selenium itself, it was kind of recent, maybe a year ago or so that's
actually been folded into our OpenSci as well.
So it's being actively maintained, there was a period where it was a bit stagnant, but
now it looks like active development is back.
And yeah, I'll embellish a little bit on what I teased earlier.
I have definitely explored selenium quite a bit as a precursor to like what we know
now with shiny test, a way to test shiny apps and the tests of our web things I was creating.
I admit getting selenium working in an enterprise environment is painful, at least it was for
me.
Now, one thing that could help though, you may remember I've been harping on a certain
technology that can abstract away a lot of these niche pain points you might have.
You can put selenium in a Docker container folks and not have to deal with the Java nightmares
that I had to deal with years ago.
So I will also in addition to what Mike you asked to link in the show notes, I'm also
going to link a great tutorial from Callum Taylor on putting selenium in Docker and then
using our selenium to call that instead of a local install.
That could be a big help to those of you out there that have shared the pain I've had with
getting Java stuff working, especially on Windows machines.
Not that I don't know anything about that, but you can use containers to your advantage
at least for my case and that made selenium a lot more accessible for me.
I'm going to have to read that post as well because my weekly GitHub action that has been
running with selenium flawlessly for the last six to nine months broke this week because
there's a GitHub actions update from Node.js from 12 to version 16 and I am struggling
with figuring out how to fix it if you can't tell, but I'll leave it at that.
There are so many rabbit holes where you go on GitHub action failures, especially when
dependencies are updated without you knowing it.
You know me, I like to have these, you might say these mental checklists of things I have
to build up before I launch these efforts.
Well, our last high is going to give us some checklists to think about.
We have a little visit to our visualization corner as always and mostly on these episodes
and in particular, very regular contributor these days, Albert Rapp is back at it again
with the visualization roots, so to speak, about a great blog post that he's put together
on little, you know, might say checklist items that you can have as you're creating that
next fancy little bar chart that summarized in your great data insight.
And so this is a very practical post.
He literally starts the bar chart from literally the vanilla ggplot to aesthetic and then starts
to gradually build both things that make it easier to interpret the result, but also ways
to present the results in a cleaner way.
So what kind of items did you resonate with the most here on this checklist, Mike?
Yeah, well, first things first, this looks like another great quarto blog, shout out
quarto if I haven't done that enough times yet today.
So for everyone counting out there, there's another quarto shout out for you.
Fantastic blog post again, great walkthrough by Albert Rapp on some awesome data viz content.
Horizontal bars are greater than vertical bars.
I said what I said, but especially important when you have more than I'd say five categories,
which I often have more than five categories, invert that access, flip it and let's see
those categories on the y axis, please.
But data viz is all about communicating the data in the most efficient and effective way
possible to the audience and Albert highlights in this blog post, just some great tips for
doing that for bar plots specifically, including bar ordering, taking sort of what I would
call a minimalistic approach and dropping your access labels in favor of data labels
within each bar.
The blog post concludes with a pretty interesting bar plot where the labels for both x and y
values are above each individual bar, something that you're probably going to have to check
out yourself, so check out the issue in this particular blog post because it's tricky to
articulate on a podcast again.
But as always, Albert includes all of the fantastic code that produces each of the charts
that he shows as steps along the way towards the final chart in this blog post.
And I'll note that this is all contained to the tidy verse.
There's no additional plotting packages beyond ggplot2 at play here and that's just another
example of how far ggplot2 can get us in creating beautiful data visualizations.
So kudos to Albert for continuing to just pump out fantastic data viz and our content
for us and this is a really nicely done blog post.
Yeah, really enjoyed reading it and I'm happy to say that even in an unlikely situation,
I followed at least some of this advice when I last year created this completely over the
top shiny dashboard that visualized virtual racing league results.
It started as a bar chart and ended up being an animated chart of little racing cars.
But it was horizontal.
I got that one right.
Yes, appreciate that.
The cars weren't driving against gravity.
Not quite, although they were going backwards.
That's another story for another day.
But yes, but it was fun to build that.
But again, with respect to this post, everything that you said, Mike, is all achievable.
No hacks, no additional packages needed over in ggplot2.
It's a great companion to some of the advice we also see from other thought leaders like
Cedric Schurer and his many ggplot resources.
I think what Albert does here is that this is probably what you're going to start with
as you get into the exercise of, okay, I know how to make a chart, but how do I start adding
these nice features to make interpretation easier, to make presentation easier?
Because again, you never know where these lead to.
Maybe this chart is going to go to an executive somewhere.
So you got to make sure they can quickly see that main point on the plot.
So the little touches here do add up, folks, it really does.
Even when I made an app last year for work where I was doing ggplots of clinical results
to get to an eventual PowerPoint presentation, I would let the user kind of do these little
checklist items if opt into certain things.
But yeah, having awareness of these items and be able to mix and match to your needs,
great flexibility that ggplot2 offers and great job by Albert to summarize all that
up for us.
Yeah, really excellent blog post here as always.
And I don't know how he cranks all this stuff out.
If he's got some magic pull for time management, I need to take it because I'm not able to
crank in nearly the amount of stuff out that he is, but it's really, really well done here.
You're not kidding.
Yep.
And what else we're not kidding about, another really awesome issue of our weekly.
That's a broken record at this point, but that's why we do this podcast, right?
So really great insights here and a little find that I had as I was coming through this,
one that's not on the issue, but I want to call out as well, but I'll get to what's on
the issue here is that when I was preparing for the good old days of graduate or university
school, I would have these tests, right?
I have to memorize certain concepts or it was in stats or computer science and the like.
Does you know that now in R you can create flashcards?
Yes you can with the flasher package created by Jeffrey Stevens.
I was poking around this a little bit, but it is actually built on top of reveal JS to
give you what are kind of under the hood slides, but a great way to test your knowledge, you
know, a little nice thing you might want to do next time you're preparing for that really
hard exam if you're out there in the school work.
I feel you.
I remember to graduate school both fondly and not so fondly, especially for those exams.
So these are the help you're studying.
Why not?
That's absolutely something that I would do instead of just writing out 10 flashcards
in 15 minutes.
I would program flashcards using the flasher package for like eight hours trying to figure
out how to get these beautiful flashcards to work and maybe learn some of the content
along the way.
Yeah.
That that's never the fun part, right?
It's getting there.
Yeah.
And the other thing I want to quickly mention before I turn back over to Mike is our friends
at jumping rivers have released videos of their recent shiny in production conference
and there's about like seven out there at the time of our recording here.
So I highly recommend checking those out if you're kind of wanting to see how others in
the enterprise are using shiny and very important situations, some great knowledge shared by
Colin Faye and others in the community on widget building and comparing different frameworks
or web apps.
Lots of lots of good content there.
But yeah, Mike, what did you find this week?
Yes.
So our weeklies own Tony Elhavar created a fantastic gist on GitHub using HETR package
and some tidyverse packages to leverage FIFA's API and show you how to download player specific
or team specific statistics from the World Cup.
So this was an exciting one as a soccer nerd and data nerd.
This was an exciting gist to come across and I thought it was kind of an interesting unique
entry into our weekly this week where you can get your hands dirty with some code that's
laid out right in front of you.
Very timely, too.
We were talking about scraping earlier.
So another great way to get your scraping hats on.
And luckily for this one, you don't have to worry about custom JavaScript serving those
results.
So your life's a little easier then.
And I wish all these sites would make it easier to access all this.
But hey, at least this one does.
So great, great gist as always.
And also want to give a quick plug getting to our kind of feedback segment.
We don't have any new feedback per se.
But our previous contributor, Rasta Calavera, has actually been boosting the show again,
but automatically.
You might be wondering how the heck is that pulled off?
That's for one of these new podcast apps called Fountain, where they kind of give you a little
budget of these stats to play with.
And just by listening to a show, you can contribute any little bit every few minutes or so.
So thank you, Rasta, for the extra, I believe, 100 sats from that little automatic contribution.
And however way you want to contribute, all you have to do is find yourself on those fancy
new podcast apps at newpodcastapps.com.
And we'll be glad to hopefully hear from you in the future.
But also we want to hear from you for contributing to R Weekly itself.
You know how you can get your content on there?
Well, you go to rweekly.org, first of all, if you don't have a bookmark.
Bookmark it today so that you can get access to the upcoming draft and send us a little
pull requests of any great blog posts, new package, tutorial out there, whatever have
you, any great novel use of data science in R. We're always happy to put that into the
next issue.
So we're always just a pull request away, folks.
And also if you want to get in touch with us at R Weekly, we also have a new Mastodon
account that's at rweekly at fostodon.org.
And also you can get in touch with each of us.
You're friendly hosts here individually.
I am still somewhat on Twitter with at the Rcast, but also I am on Mastodon as well with
at our podcast at podcastindex.social.
And Mike, where can they find you?
I'm still hanging around on Twitter for the time being as well at mike underscore Ketchbrook.
And then I am on Mastodon as at Mike underscore Thomas, let me check my profile, what's the
at fostodon.org.
At Mike underscore Thomas at fostodon.org.
Yeah.
So you can tell we're still getting used to calling these out, but it's going to become
second nature sooner or later.
But we've been hearing from some of you already on these new avenues.
So thank you for the shout outs and certainly stay tuned for more updates in the future.
Speaking of the future, well, we're going to have to close up episode 102, but we will
be back with episode 103 of the R Weekly Highlights podcast next week.