|
# GitHub Crawler
|
|
# GitHub Crawler
|
|
|
|
|
|
GitHub Crawler is an application that has been developed within the context of the SDK4ED projects, as part of the Dependability Toolbox, in order to enable the automatic collection and static analysis of a large number of open-source software repositories from GitHub.
|
|
GitHub Crawler is an application that has been developed within the context of the SDK4ED project, as part of the Dependability Toolbox, in order to enable the automatic collection and static analysis of a large number of open-source software repositories from GitHub.
|
|
|
|
|
|
The GitHub Crawler is responsible for (i) downloading a large number of open-source repositories from GitHub based on a user's query, (ii) compiling the downloaded repositories, and (iii) analyzing the downloaded repositories with CKJM Extended (link) and CCCC (link) static code analyzers, in order to compute popular software metrics. GitHub crawler is useful for researchers and practicionairs that would like to generate benchmark repositories of open-source software applications for further analysis and processing, which are particularly important for the conduction of empirical studies. Currently, the GitHub Crawler can download applications written in any programming language; however, it can compute software metrics only for software repositories written in Java, C, and C++.
|
|
The GitHub Crawler is responsible for (i) downloading a large number of open-source repositories from GitHub based on a user's query, (ii) compiling the downloaded repositories, and (iii) analyzing the downloaded repositories with CKJM Extended (link) and CCCC (link) static code analyzers, in order to compute popular software metrics. GitHub crawler is useful for researchers and practicionairs that would like to generate benchmark repositories of open-source software applications for further analysis and processing, which are particularly important for the conduction of empirical studies. Currently, the GitHub Crawler can download applications written in any programming language; however, it can compute software metrics only for software repositories written in Java, C, and C++.
|
|
|
|
|
... | @@ -16,6 +16,24 @@ As can be seen by the figure above, initially, the user has to provide a set of |
... | @@ -16,6 +16,24 @@ As can be seen by the figure above, initially, the user has to provide a set of |
|
|
|
|
|
It should be noted that the steps are sequential and optional. The user, at the start of the analysis, could determine what to be executed through properly setting a specific parameter. The user could select to:
|
|
It should be noted that the steps are sequential and optional. The user, at the start of the analysis, could determine what to be executed through properly setting a specific parameter. The user could select to:
|
|
|
|
|
|
Simply download software repositories that satisfy their query (skiping the compilation and analysis steps)
|
|
Simply download software repositories that satisfy their query (skiping the compilation and analysis steps) Download software repositories and compile them (skipping the analysis step) Download, compile, and analyze the software repositories (i.e., perform a complete analysis)
|
|
Download software repositories and compile them (skipping the analysis step)
|
|
|
|
Download, compile, and analyze the software repositories (i.e., perform a complete analysis) |
|
## Usage of the GitHub Crawler
|
|
|
|
|
|
|
|
The GitHub Crawler can be used (indirectly) through the API that is provided by the Dependability Toolbox. For more information on how to use the Dependability Toolbox, please check its dedicated wiki page (link).
|
|
|
|
|
|
|
|
Apart from the Dependability Toolbox, we provide the option to be executed through the terminal. The general structure of the command that should be executed is provided below:
|
|
|
|
|
|
|
|
```
|
|
|
|
java -jar githubCrawler.jar <analysis_type> <language> <sort> <max_num> <message>
|
|
|
|
```
|
|
|
|
|
|
|
|
In the table below, a description of the parameters is provided:
|
|
|
|
| Parameter | Description |
|
|
|
|
|-----------|-------------|
|
|
|
|
| analysis_type | The type of the analysis that should be performed by the GitHub Crawler. <br>Proper values: <br>"1": Download Only (default) <br>"2": Download and Compile <br>"3": Full analysis (Download, Compile, and Analyze) |
|
|
|
|
| language | The programming language of the software repositories that will be downloaded (e.g., Java). |
|
|
|
|
| sort | How the retrieved repositories should be shorted (e.g., by stars, by forks, etc.) |
|
|
|
|
| max_num | The maximum number of software repositories to be gathered (i.e., when the process should terminate) (default value: 100) |
|
|
|
|
| message | An additional search string in order to narrow down the scope of the search. It can be any string deemed necessary by the user. |
|
|
|
|
|