Moz Blog |
| Using Kimono Labs to Scrape the Web for Free Posted: 11 Jun 2014 03:38 PM PDT Posted by CatalystSEM This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of Moz, Inc. Historically, I have written and presented about big data—using data to create insights, and how to automate your data ingestion process by connecting to APIs and leveraging advanced database technologies. Recently I spoke at SMX West about leveraging the rich data in webmaster tools. After the panel, I was approached by the in-house SEO of a small company, who asked me how he could extract and leverage all the rich data out there without having a development team or large budget. I pointed him to the CSV exports and some of the more hidden tools to extract Google data, such as the GA Query Builder and the YouTube Analytics Query Builder. However, what do you do if there is no API? What do you do if you want to look at unstructured data, or use a data source that does not provide an export? For today's analytics pros, the world of scraping—or content extraction (sounds less black hat)—has evolved a lot, and there are lots of great technologies and tools out there to help solve those problems. To do so, many companies have emerged that specialize in programmatic content extraction such as Mozenda, ScraperWiki, ImprtIO, and Outwit, but for today's example I will use Kimono Labs. Kimono is simple and easy to use and offers very competitive pricing (including a very functional free version). I should also note that I have no connection to Kimono; it's simply the tool I used for this example. Before we get into the actual "scraping" I want to briefly discuss how these tools work. The purpose of a tool like Kimono is to take unstructured data (not organized or exportable) and convert it into a structured format. The prime example of this is any ranking tool. A ranking tool reads Google's results page, extracts the information and, based on certain rules, it creates a visual view of the data which is your ranking report. Kimono Labs allows you to extract this data either on demand or as a scheduled job. Once you've extracted the data, it then allows you to either download it via a file or extract it via their own API. This is where Kimono really shines—it basically allows you to take any website or data source and turn it into an API or automated export. For today's exercise I would like to create two scrapers. A. A ranking tool that will take Google's results and store them in a data set, just like any other ranking tool. (Disclaimer: this is meant only as an example, as scraping Google's results is against Google's Terms of Service). B. A ranking tool for Slideshare. We will simulate a Slideshare search and then extract all the results including some additional metrics. Once we have collected this data, we will look at the types of insights you are able to generate. 1. Sign upSignup is simple; just go to http://www.kimonolabs.com/signup and complete the form. You will then be brought to a welcome page where you will be asked to drag their bookmarklet into your bookmarks bar. The Kimonify Bookmarklet is the trigger that will start the application. 2. Building a ranking toolSimply navigate your browser to Google and perform a search; in this example I am going to use the term "scraping." Once the results pages are displayed, press the kimonify button (in some cases you might need to search again). Once you complete your search you should see a screen like the one below: It is basically the default results page, but on the top you should see the Kimono Tool Bar. Let's have a close look at that: The bar is broken down into a few actions:
After you press the bookmarklet you need to start tagging the individual elements you want to extract. You can do this simply by clicking on the desired elements on the page (if you hover over it, it changes color for collectable elements). Kimono will then try to identify similar elements on the page; it will highlight some suggested ones and you can confirm a suggestion via the little checkmark: A great way to make sure you have the correct elements is by looking at the count. For example, we know that Google shows 10 results per page, therefore we want to see "10" in the item count box, which indicates that we have 10 similar items marked. Now go ahead and name your new item group. Each collection of elements should have a unique name. In this page, it would be "Title". Now it's time to confirm the data; just click on the little Data icon to see a preview of the actual data this page would collect. In the data view you can switch between different formats (JSON, CSV and RSS). If everything went well, it should look like this: As you can see, it not only extracted the visual title but also the underlying link. Good job! To collect some more info, click on the Extractor icon again and pick out the next element. Now click on the Plus icon and then on the description of the first listing. Since the first listing contains site links, it is not clear to Kimono what the structure is, so we need to help it along and click on the next description as well. As soon as you do this, Kimono will identify some other descriptions; however, our count only shows 8 instead of the 10 items that are actually on that page. As we scroll down, we see some entries with author markup; Kimono is not sure if they are part of the set, so click the little checkbox to confirm. Your count should jump to 10. Once you've highlighted both names, results should look like the one below, with the count in the circle being 2 representing the two authors listed on this page. Out of interest I did the same for the number of people in their Google+ circles. Once you have done that, click on the Model View button, and you should see all the fields. If you click on the Data View you should see the data set with the authors and circles. As a final step, let's go back to the Extractor view and define the pagination; just click the Pagination button (it looks like a book) and select the next link. Once you have done that, click Done. You will be presented with a screen similar to this one: Here you simply name your API, define how often you want this data to be extracted and how many pages you want to crawl. All of these settings can be changed manually; I would leave it with On demand and 10 pages max to not overuse your credits. To collect the listings requires a quick setup. Click on the pagination tab, turn it on and set your schedule to On demand to pull data when you ask it to. Your screen should look like this: Now press Crawl and Kimono will start collecting your data. If you see any issues, you can always click on Edit API and go back to the extraction screen. 3. Extracting SlideShare dataWith Slideshare's recent growth in popularity it has become a document sharing tool of choice for many marketers. But what's really on Slideshare, who are the influencers, what makes it tick? We can utilize a custom scraper to extract that kind data from Slideshare. To get started, point your browser to Slideshare and pick a keyword to search for. For our example I want to look at presentations that talk about PPC in English, sorted by popularity, so the URL would be: Once you are on that page, pick the Kimonify button as you did earlier and tag the elements. In this case I will tag:
Once you have tagged those, go ahead and add the pagination as described above. That will make a nice rich dataset which should look like this: Hit Done and you're finished. In order to quickly highlight the benefits of this rich data, I am going to load the data into Spotfire to get some interesting statics (I hope). 4. InsightsRather than do a step-by-step walktrough of how to build dashboards, which you can find here, I just want to show you some insights you can glean from this data:
5. OutputOne of the great things about Kimono we have not really covered is that it actually converts websites into APIs. That means you build them once, and each time you need the data you can call it up. As an example, if I call up the Slideshare API again tomorrow, the data will be different. So you basically appified Slisdeshare. The interesting part here is the flexibility that Kimono offers. If you go to the How to Use slide, you will see the way Kimono treats the Source URL In this case it looks like this: The way you can pull data from Kimono aside from the export is their own API; in this case you call the default URL, You would get the default data from the original URL; however, as illustrated in the table above, you can dynamically adjust elements of the source URL. For example, if you append "&q=SEO" I know this was a lot of information, but believe me when I tell you, we just scratched the surface. Tools like Kimono offer a variety of advanced functions that really open up the possibilities. Once you start to realize the potential, you will come up with some amazing, innovative ideas. I would love to see some of them here shared in the comments. So get out there and start scraping … and please feel free to tweet at me or reply below with any questions or comments! Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read! |
| Your Google Algorithm Cheat Sheet: Panda, Penguin, and Hummingbird Posted: 10 Jun 2014 05:16 PM PDT Posted by MarieHaynes If you're reading the Moz blog, then you probably have a decent understanding of Google and its algorithm changes. However, there is probably a good percentage of the Moz audience that is still confused about the effects that Panda, Penguin, and Hummingbird can have on your site. I did write a post last year about the main differences between Penguin and a Manual Unnautral Links Penalty, and if you haven't read that, it'll give you a good primer. The point of this article is to explain very simply what each of these algorithms are meant to do. It is hopefully a good reference that you can point your clients to if you want to explain an algorithm change and not overwhelm them with technical details about 301s, canonicals, crawl errors, and other confusing SEO terminologies. What is an algorithm change?First of all, let's start by discussing the Google algorithm. It's immensely complicated and continues to get more complicated as Google tries its best to provide searchers with the information that they need. When search engines were first created, early search marketers were able to easily find ways to make the search engine think that their client's site was the one that should rank well. In some cases it was as simple as putting in some code on the website called a meta keywords tag. The meta keywords tag would tell search engines what the page was about. As Google evolved, its engineers, who were primarily focused on making the search engine results as relevant to users as possible, continued to work on ways to stop people from cheating, and looked at other ways to show the most relevant pages at the top of their searches. The algorithm now looks at hundreds of different factors. There are some that we know are significant such as having a good descriptive title (between the <title></title> tags in the code.) And there are many that are the subject of speculation such as whether or not Google +1's contribute to a site's rankings. In the past, the Google algorithm would change very infrequently. If your site was sitting at #1 for a certain keyword, it was guaranteed to stay there until the next update which might not happen for weeks or months. Then, they would push out another update and things would change. They would stay that way until the next update happened. If you're interested in reading about how Google used to push updates out of its index, you may find this Webmaster World forum thread from 2002 interesting. (Many thanks to Paul Macnamara for explaining to me how algo changes used to work on Google in the past and pointing me to the Webmaster World thread.) This all changed with launch of "Caffeine" in 2010. Since Caffeine launched, the search engine results have been changing several times a day rather than every few weeks. Google makes over 600 changes to its algorithm in a year, and the vast majority of these are not announced. But, when Google makes a really big change, they give it a name, usually make an announcement, and everyone in the SEO world goes crazy trying to figure out how to understand the changes and use them to their advantage. Three of the biggest changes that have happened in the last few years are the Panda algorithm, the Penguin algorithm and Hummingbird. What is the Panda algorithm? Panda first launched on February 23, 2011. It was a big deal. The purpose of Panda was to try to show high-quality sites higher in search results and demote sites that may be of lower quality. This algorithm change was unnamed when it first came out, and many of us called it the "Farmer" update as it seemed to affect content farms. (Content farms are sites that aggregate information from many sources, often stealing that information from other sites, in order to create large numbers of pages with the sole purpose of ranking well in Google for many different keywords.) However, it affected a very large number of sites. The algorithm change was eventually officially named after one of its creators, Navneet Panda. When Panda first happened, a lot of SEOs in forums thought that this algorithm was targeting sites with unnatural backlink patterns. However, it turns out that links are most likely not a part of the Panda algorithm. It is all about on-site quality. In most cases, sites that were affected by Panda were hit quite hard. But, I have also seen sites that have taken a slight loss on the date of a Panda update. Panda tends to be a site-wide issue which means that it doesn't just demote certain pages of your site in the search engine results, but instead, Google considers the entire site to be of lower quality. In some cases though Panda can affect just a section of a site such as a news blog or one particular subdomain. Whenever a Google employee is asked about what needs to be done to recover from Panda, they refer to a blog post by Google Employee Amit Singhal that gives a checklist that you can use on your site to determine if your site really is high quality or not. Here is the list:
Phew! That list is pretty overwhelming! These questions do not necessarily mean that Google tries to algorithmically figure out whether your articles are interesting or whether you have told both sides of a story. Rather, the questions are there because all of these factors can contribute to how real-life users would rate the quality of your site. No one really knows all of the factors that Google uses in determining the quality of your site through the eyes of Panda. Ultimately though, the focus is on creating the best site possible for your users. It is also important that only your best stuff is given to Google to have in its index. There are a few factors that are widely accepted as important things to look at in regards to Panda: Thin contentA "thin" page is a page that adds little or no value to someone who is reading it. It doesn't necessarily mean that a page has to be a certain number of words, but quite often, pages with very few words are not super-helpful. If you have a large number of pages on your site that contain just one or two sentences and those pages are all included in the Google index, then the Panda algorithm may determine that the majority of your indexed pages are of low quality. Having the odd thin page is not going to cause you to run in to Panda problems. But, if a big enough portion of your site contains pages that are not helpful to users, then that is not good. Duplicate contentThere are several ways that duplicate content can cause your site to be viewed as a low-quality site by the Panda algorithm. The first is when a site has a large amount of content that is copied from other sources on the web. Let's say that you have a blog on your site and you populate that blog with articles that are taken from other sources. Google is pretty good at figuring out that you are not the creator of this content. If the algorithm can see that a large portion of your site is made up of content that exists on other sites then this can cause Panda to look at you unfavorably. You can also run into problems with duplicated content on your own site. One example would be for a site that has a large number of products for sale. Perhaps each product has a separate page for each color variation and size. But, all of these pages are essentially the same. If one product comes in 20 different colors and each of those come in 6 different sizes, then that means that you have 120 pages for the same product, all of which are almost identical. Now, imagine that you sell 4,000 products. This means that you've got almost half a million pages in the Google index when really 4,000 pages would suffice. In this type of situation, the fix for this problem is to use something called a canonical tag. Moz has got a really good guide on using canonical tags here, and Dr. Pete has also written this great article on canonical tag use. Low-quality contentWhen I write an article and publish it on one of my websites, the only type of information that I want to present to Google is information that is the absolute best of its kind. In the past, many SEOs have given advice to site owners saying that it was important to blog every day and make sure that you are always adding content for Google to index. But, if what you are producing is not high quality content, then you could be doing more harm than good. A lot of Amit Singhal's questions listed above are asking whether the content on your site is valuable to readers. Let's say that I have an SEO blog and every day I take a short blurb from each of the interesting SEO articles that I have read online and publish it as a blog post on my site. Is Google going to want to show searchers my summary of these articles, or would they rather show them the actual articles? Of course my summary is not going to be as valuable as the real thing! Now, let's say that I have done this every day for 4 years. Now my site has over 4,000 pages that contain information that is not unique and not as valuable as other sites on the same topics. Here is another example. Let's say that I am a plumber. I've been told that I should blog regularly, so several times a week I write a 2-3 paragraph article on things like, "How to fix a leaky faucet" or "How to unclog a toilet." But, I'm busy and don't have much time to put into my website so each article I've written contains keywords in the title and a few times in the content, but the content is not in depth and is not that helpful to readers. If the majority of the pages on my site contain information that no one is engaging with, then this can be a sign of low quality in the eyes of the Panda algorithm. There are other factors that probably play a roll in the Panda algorithm. Glenn Gabe recently wrote an excellent article on his evaluation of sites affected by the most recent Panda update. His bullet point list of things to improve upon when affected by Panda is extremely thorough. How to recover from a Panda hitGoogle refreshes the Panda algorithm approximately monthly. They used to announce whenever they were refreshing the algorithm, but now they only do this if there is a really big change to the Panda algorithm. What happens when the Panda algorithm refreshes is that Google takes a new look at each site on the web and determines whether or not it looks like a quality site in regards to the criteria that the Panda algorithm looks at. If your site was adversely affected by Panda and you have made changes such as removing thin and duplicate content then, when Panda refreshes, you should see that things improve. However, for some sites it can take a couple of Panda refreshes to see the full extent of the improvements. This is because it can sometimes take several months for Google to revisit all of your pages and recognize the changes that you have made. Every now and then, instead of just refreshing the algorithm, Google does what they call an update. When an update happens, this means that Google has changed the criteria that they use to determine what is and isn't considered high quality. On May 20, 2014, Google did a major update which they called Panda 4.0. This caused a lot of sites to see significant changes in regards to Panda: Not all Panda recoveries are as dramatic as this one. But, if you have been affected by Panda and you work hard to make changes to your site, you really should see some improvement. What is the Penguin algorithm? The Penguin algorithm initially rolled out on April 24, 2012. The goal of Penguin is to reduce the trust that Google has in sites that have cheated by creating unnatural backlinks in order to gain an advantage in the Google results. While the primary focus of Penguin is on unnatural links, there can be other factors that can affect a site in the eyes of Penguin as well. Links, though, are known to be by far the most important thing to look at. Why are links important?A link is like a vote for your site. If a well respected site links to your site, then this is a recommendation for your site. If a small, unknown site links to you then this vote is not going to count for as much as a vote from an authoritative site. Still, if you can get a large number of these small votes, they really can make a difference. This is why, in the past, SEOs would try to get as many links as they could from any possible source. Another thing that is important in the Google algorithms is anchor text. Anchor text is the text that is underlined in a link. So, in this link to a great SEO blog, the anchor text would be "SEO blog." If Moz.com gets a number of sites linking to them using the anchor text "SEO blog," that is a hint to Google that people searching for "SEO blog" probably want to see sites like Moz in their search results. It's not hard to see how people could manipulate this part of the algorithm. Let's say that I am doing SEO for a landscaping company in Orlando. In the past, one of the ways that I could cheat the algorithm into thinking that my company should be ranked highly would be to create a bunch of self made links and use anchor text in these links that contain phrases like Orlando Landscaping Company, Landscapers in Orlando and Orlando Landscaping. While an authoritative link from a well respected site is good, what people discovered is that creating a large number of links from low quality sites was quite effective. As such, what SEOs would do is create links from easy to get places like directory listings, self made articles, and links in comments and forum posts. While we don't know exactly what factors the Penguin algorithm looks at, what we do know is that this type of low quality, self made link is what the algorithm is trying to detect. In my mind, the Penguin algorithm is sort of like Google putting a "trust factor" on your links. I used to tell people that Penguin could affect a site on a page or even a keyword level, but Google employee John Mueller has said several times now that Penguin is a sitewide algorithm. This means that if the Penguin algorithm determines that a large number of the links to your site are untrustworthy, then this reduces Google's trust in your entire site. As such, the whole site will see a reduction in rankings. While Penguin affected a lot of sites drastically, I have seen many sites that saw a small reduction in rankings. The difference, of course, depends on the amount of link manipulation that has been done. How to recover from a Penguin hit?Penguin is a filter just like Panda. What that means, is that the algorithm is re-run periodically and sites are re-evaluated with each re-run. At this point it is not run very often at all. The last update was October 4, 2013 which means that we have currently been waiting eight months for a new Penguin update. In order to recover from Penguin, you need to identify the unnatural links pointing to your site and either remove them, or if you can't remove them you can ask Google to no longer count them by using the disavow tool. Then, the next time that Penguin refreshes or updates, if you have done a good enough job at cleaning up your unnatural links, you will once again regain trust in Google's eyes. In some cases, it can take a couple of refreshes in order for a site to completely escape Penguin because it can take up to 6 months for all of a site's disavow file to be completely processed. If you are not certain how to identify which links to your site are unnatural, here are some good resources for you:
The disavow tool is something that you probably should only be using if you really understand how it works. It is potentially possible for you to do more harm than good to your site if you disavow the wrong links. Here is some information on using the disavow tool:
It's important to note that when sites "recover" from Penguin, they often don't skyrocket up to top rankings once again as those previously high rankings were probably based on the power of links that are now considered unnatural. Here is some information on what to expect when you have recovered from a link based penalty or algorithmic issue. Also, the Penguin algorithm is not the same thing as a manual unnatural links penalty. You do not need to file a reconsideration request to recover from Penguin. You also do not need to document the work that you have done in order to get links removed as no Google employee will be manually reviewing your work. As mentioned previously, here is more information on the difference between the Penguin algorithm and a manual unnatural links penalty. What is Hummingbird? Hummingbird is a completely different animal than Penguin or Panda. (Yeah, I know...that was a bad pun.) I will commonly get people emailing me telling me that Hummingbird destroyed their rankings. I would say that in almost every case that I have evalutated, this was not true. Google made their announcement about Hummingbird on September 26, 2013. However, at that time, they announced that Hummingbird had already been live for about a month. If the Hummingbird algorithm was truly responsible for catastrophic ranking fluctuations then we really should have seen an outcry from the SEO world of something drastic happening in August of 2013, and this did not happen. There did seem to be some type of fluctuation that happened around August 21 as reported here on Search Engine Round Table, but there were not many sites that reported huge ranking changes on that day. If you think that Hummingbird affected you, it's not a bad idea to look at your traffic to see if you noticed a drop on October 4, 2013 which was actually a refresh of the Penguin algorithm. I believe that a lot of people who thought that they were affected by Hummingbird were actually affected by Penguin which happened just a week after Google made their announcement about Hummingbird. There are some excellent articles on Hummingbird here and here. Hummingbird was a complete overhaul of the entire Google algorithm. As Danny Sullivan put it, if you consider the Google algorithm as an engine, Panda and Penguin are algorithm changes that were like putting a new part in the engine such as a filter or a fuel pump. But, Hummingbird wasn't just a new part; it was a completely new engine. That new engine still makes use of many of the old parts (such as Panda and Penguin) but a good amount of the engine is completely original. The goal of the Hummingbird algorithm is for Google to better understand a user's query. Bill Slawski who writes about Google patents has a great example of this in his post here. He explains that when someone searches for "What is the best place to find and eat Chicago deep dish style pizza?", Hummingbird is able to discern that by "place" the user likely would be interested in results that show "restaurants". There is speculation that these changes were necessary in order for Google's voice search to be more effective. When we're typing a search query, we might type, "best Seattle SEO company" but when we're speaking a query (i.e. via Google Glass or via Google Now) we're more likely to say something like, "Which firm in Seattle offers the best SEO services?" The point of Hummingbird is to better understand what users mean when they have queries like this. So how do I recover or improve in the eyes of Hummingbird?If you read the posts referenced above, the answer to this question is essentially to create content that answers users queries rather than just trying to rank for a particular keyword. But really, this is what you should already be doing! It appears that Google's goal with all of these algorithm changes (Panda, Penguin and Hummingbird) is to encourage webmasters to publish content that is the best of its kind. Google's goal is to deliver answers to people who are searching. If you can produce content that answers people's questions, then you're on the right track. I know that that is a really vague answer when it comes to "recovering" from Hummingbird. Hummingbird really is different than Panda and Penguin. When a site has been demoted by the Panda or Penguin algorithm, it's because Google has lost some trust in the site's quality, whether it is on-site quality or the legitimacy of its backlinks. If you fix those quality issues you can regain the algorithm's trust and subsequently see improvements. But, if your site seems to be doing poorly since the launch of Hummingbird, then there really isn't a way to recover those keyword rankings that you once held. You can, however, get new traffic by finding ways to be more thorough and complete in what your website offers. Do you have more questions?My goal in writing this article was to have a resource to point people to when they had basic questions about Panda, Penguin and Hummingbird. Recently, when I published my penalty newsletter, I had a small business owner comment that it was very interesting but that most of it went over their head. I realized that many people outside of the SEO world are greatly affected by these algorithm changes, but don't have much information on why they have affected their website. Do you have more questions about Panda, Penguin or Hummingbird? If so, I'd be happy to address them in the comments. I also would love for those of you who are experienced with dealing with websites affected by these issues to comment as well. Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read! |
| You are subscribed to email updates from Moz Blog To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
| Google Inc., 20 West Kinzie, Chicago IL USA 60610 | |
No comments:
Post a Comment