Carrot scraper

1/4/2024

If you have more than one workflow, this is what will be displayed on your Actions tab. yml file ours is called update.yml with about 20 lines of code. You can also click on the “Actions” tab of your repository on GitHub, and commit a simple workflow. For our example above, the repo consists of a single Makefile and some folders. Here’s an example of the process you can follow to move an existing project into GitHub Actions:įirst, move your scripts to GitHub if you are not keeping them there already. GitHub has great documentation on how to get started with Actions, and we highly recommend taking a look if you are thinking of creating one. And no one has to clone the repo unless it’s for development purposes! Creating a GitHub Actions workflow A typical workflow would require a reporter that’s on shift to clone the repository, run the make command on their local terminal and push the changes back up.īy using GitHub Actions, we’ve allowed the reporters to complete the whole workflow with a single click of a button. These commands need to be run every day-and even multiple times a day during a fire season. The Makefile, along with two folders for GeoJSON files, are kept in a GitHub repository. filter '"Evacuation Order, Evacuation Warning".indexOf(STATUS) > -1' \ It looks like this: download: curl "veryLongUrl" -o raw/zones.geojson This requires two terminal commands (we use curl and mapshaper for these tasks), which can be run from a Makefile. The scraper that keeps our map updated needs to do two things: download the GeoJSON, then filter to the zones we need. Most of our evacuation zones come from a statewide evacuation map hosted on a California Department of Technology site, which updates frequently. One of our daily tasks at the Data and Graphics department is updating the wildfire evacuation zones shown on our Wildfire Map. There’s also an excellent tutorial for beginners on using GitHub for scrapers on Ben Welsh’s site. Below, we’ll use a very simple example to show how you can get your own scraper ready on GitHub Actions. In order to use this blank computer, you’ll need to write a few instructions in YAML to define what libraries need to be installed, what script to run, and how often this needs to happen. If you’ve never used something like that before, think of it as renting a blank computer with an operating system of your choice (Ubuntu, MacOS or Windows).

Simply put, it’s GitHub’s own continuous integration platform. And it keeps the history of your scrapes-Simon Wilson calls it “ git scraping”-allowing you to go back in your git history if you want to see how the data changed over time. It’s also free to use, if your repository is public (and even allows for certain free minutes for private repos). Using GitHub Actions meant that we didn’t have to upload our code to a different service every time we made changes. For one, our code-from scrapers to cleaners to aggregators-all lived in GitHub already. In hindsight, it was an ideal choice for a few reasons. We started using GitHub Actions for our COVID–19 scrapers about two years ago. Many of our scrapers feed and update applications like our Coronavirus Tracker, State Elections Money Tracker and Drought Tracker. Some run a couple times a day while others may run once a week. That’s where the challenge is, because who wants to get on their computer and save a csv to a folder on a Sunday?Īt the Los Angeles Times Data and Graphics desk, we use GitHub Actions for almost all of our scrapers that run on a schedule. Getting that data may require some fancy scraper or just a simple curl command that downloads a csv, but either way, the task has to be done on a regular basis. You might be monitoring your state’s monthly WARN notices, or, like many of us, tracking changes in daily COVID–19 cases. Sometimes collecting the data once isn’t enough.

0 Comments

Carrot scraper

Leave a Reply.

Author

Archives

Categories