Code
import os
"/Users/ravikalia/Code/github.com/ml-blog/posts/git-search/") os.chdir(
Ravi Kalia
April 22, 2024
There are many ways to search a repository, particularly a Git Repo. We will outline some use cases with examples for a “Unix-like” file directory and also a Git Repo.
Let’s do this for a library I’ve been looking at.
Some unit tests break on Apple Silicon for the open source library pyg. The maintainer disabled some tests. I want to find them. It has something to do with PyTorch not fully supporting compressed sparse tensor representations on Apple’s mps
framework for Apple Silicon. I received the following note:
there are a few tests that were disabled around
test_sparse
and
the
convert_coo_to_csr_indices
doesn’t seem to be supported.
It’s likely that test_sparse and convert_coo_to_csr_indices are variable names or tokens inside a code file of the git repository. However, for illustration, we will assume that they could be anywhere in the git repo (filenames, directory names, commit messages, variable names, past commits, current directory).
The objective is to find (and then later fix bugs) related to these strings. So find the strings in filenames and/or inside contents of filenames (line numbers of specific files), checking through commit history for occurences of the strings.
Time to search. The repo can be cloned locally from here and then cd into it.
Cloning into 'pytorch_geometric'...
Let’s change the directory to the root of the cloned repo, which makes searching easier
We can look for the string test_sparse
in filenames using the shell command line tool find
.
./test/utils/test_sparse.py
great, so we have a file to look at. Let’s look at the file test_sparse.py
. It seems to be unit tests related to sparsity, possibly testing utility functions for converting between sparse tensor representations.
String search is a bit more complicated. grep
is an awesome tool for this.
./test/utils/test_cross_entropy.py:9:def test_sparse_cross_entropy_multiclass(with_edge_label_weight):
./test/utils/test_cross_entropy.py:32:def test_sparse_cross_entropy_multilabel(with_edge_label_weight):
./test/test_edge_index.py:102:def test_sparse_tensor(dtype, device):
./test/test_edge_index.py:992:def test_sparse_narrow(device):
./test/test_edge_index.py:1026:def test_sparse_resize(device):
./torch_geometric/testing/asserts.py:24: test_sparse_layouts: Optional[List[Union[str, torch.layout]]] = None,
./torch_geometric/testing/asserts.py:49: test_sparse_layouts (List[str or int], optional): The sparse layouts to
./torch_geometric/testing/asserts.py:62: if test_sparse_layouts is None:
./torch_geometric/testing/asserts.py:63: test_sparse_layouts = SPARSE_LAYOUTS
./torch_geometric/testing/asserts.py:74: if len(test_sparse_layouts) > 0 and sparse_size is None:
./torch_geometric/testing/asserts.py:75: raise ValueError(f"Got sparse layouts {test_sparse_layouts}, but no "
./torch_geometric/testing/asserts.py:93: for layout in (test_sparse_layouts or []):
Binary file ./.git/index matches
Many locations matched to 3 files. It’s possible they aren’t all relevant for testing purpose. The .git/index
is a binary file, which is used by git to store information about the repository, it’s not relevant for our task.
Git is a distributed version control system. It is a tool that tracks changes in files and directories. At user-defined snapshots in time, called commits, it records the changes made to the files and directories. As a consequence it is possible to search for changes in the repository across snapshots.
Along with grep
and find
, there are git
specific tools for searching snapshots of the repo, commit messages and filtering by date
and author
, such as:
git ls-files
git log
git grep
The working tree is what you see when you list the files in your project’s directory that are being tracked. It’s the version of your project that you’re currently working on. The git checkout command is used to update the working directory with a specific commit, matching the snapshot recorded in the commit. Untracked files are not affected by git checkout.
The git ls-files
command lists the files in the working tree that are being tracked by git. The filenames can be searched for a string using the grep
command.
If we want to log commit messages (including commit ids) where filenames contain the string test_sparse
were modified, we can use the following command, truncating the output with pipe to head
:
commit 62fa51e0000913e1b3023b817485d2b248322539
Author: Matthias Fey <matthias.fey@tu-dortmund.de>
Date: Sun Dec 24 11:56:08 2023 +0100
Accelerate concatenation of `torch.sparse` tensors (#8670)
Fixes #8664
commit 1c89e751804d1eb2fb626dabc677198a1878c34d
Author: Matthias Fey <matthias.fey@tu-dortmund.de>
Date: Wed Oct 4 09:59:36 2023 +0200
Skip TorchScript bug for PyTorch < 1.12 (#8123)
commit 51c50c2f9d3372de34f4ac3617f396384a36558c
Author: filipekstrm <filip.ekstrom@hotmail.com>
Date: Tue Oct 3 20:39:04 2023 +0200
Added `mask` argument to `dense_to_sparse` (#8117)
To search for a string inside file contents across commits, we can use the git log
and git grep
commands. The git log
command lists the commits in reverse chronological order.
The flag -S
, and --all
are used to search for change in the number of occurences of the string in the repo across all branches and commits. (Again we’ll pipe to head to truncate the output.)
commit dba9659f6c4f29fd2be1f50b5ea12a29a926082f
Author: Matthias Fey <matthias.fey@tu-dortmund.de>
Date: Thu Feb 29 14:04:19 2024 +0100
Fix `EdgeIndex.resize_` linting issues (#8993)
commit 123e38ef6715f75ed9198d256cc2cb984b431630
Author: Poovaiah Palangappa <98763718+pmpalang@users.noreply.github.com>
Date: Sun Feb 11 03:32:44 2024 -0800
Example of a recommender system (#8546)
Hi Everyone,
I'm adding a recommender system example with the following salient
features
1. Dataset MovieLens – a heterogenous use case
2. Demonstrates the use of edge based temporal sampling
3. Visualization
to be specific to a branch, replace –all with the branch name (master
in this case)
If we just want commit hashes and filenames where a file was added (and has the string in its contents), we can use the --name-only
flag, made pretty:
801723efa
test/utils/test_cross_entropy.py
1dadc0705
torch_geometric/testing/asserts.py
2c01aa22c
test/utils/test_sparse.py
With regular expression search use the flag -G
( *
glob is not needed as it’s implied with regular expressions).
Nothing. It seems that the string convert_coo_to_csr_indices
is not in the contents of any files in the repo.
390942fc4
torch_geometric/data/edge_index.py
699120e25
torch_geometric/data/edge_index.py
a6f0f4947
torch_geometric/data/edge_index.py
cf786b735
torch_geometric/data/edge_index.py
b825dc637
torch_geometric/data/edge_index.py
b5ecfd9b4
torch_geometric/data/graph_store.py
torch_geometric/nn/conv/cugraph/base.py
torch_geometric/nn/conv/rgcn_conv.py
torch_geometric/nn/dense/linear.py
Let’s try a few different strings.
dba9659f6
test/test_edge_index.py
123e38ef6
test/test_edge_index.py
23bbc128d
test/test_edge_index.py
ed9698d0b
torch_geometric/testing/asserts.py
1725f1436
test/utils/test_cross_entropy.py
801723efa
test/utils/test_cross_entropy.py
1dadc0705
torch_geometric/testing/asserts.py
7b4892781
test/nn/conv/test_gcn_conv.py
72e8ef33d
test/nn/conv/test_gcn_conv.py
93fab2e53
test/nn/conv/test_gcn_conv.py
d01ea9dab
test/utils/test_sparse.py
2c01aa22c
test/utils/test_sparse.py
eb4260ce0
torch_geometric/nn/functional/pool/voxel_pool_test.py
544f4ad0e
torch_geometric/nn/functional/pool/voxel_pool_test.py
commit f0e4c829662df9eb67fd5c0abda002c9b7cd0afb
Author: Ravi Kalia <ravkalia@gmail.com>
Date: Sun Mar 24 08:05:12 2024 -0500
Replace `withCUDA` decorator: `withDevice` (#9082)
Replace `withCUDA` for a `withDevice` decorator.
Change variable name from devices to processors to reduce confusion
against pytorch api (backends/devices) and reflect the hardware choices.
Note that at this time:
## Hardware
3 repertoires of hardware can be used to run pyTorch code:
* CPU only
* CPU and GPU
* Unified Memory Single Chip
commit 25b2f208e671eeec285bfafa2e246ea0a234b312
Author: Ravi Kalia <ravkalia@gmail.com>
Date: Wed Feb 21 11:11:33 2024 -0500
docs: fix broken links to source of graph classification datasets (#8946)
**Update Broken Dataset Links in Documentation**
This PR addresses broken links in the documentation that pointed to the
common benchmark datasets. The links were updated to point to the
correct URL.
Changes were made in the following files:
1. `benchmark/kernel/README.md`
2. `docs/source/get_started/introduction.rst`
The specific changes are as follows:
In `benchmark/kernel/README.md`:
commit 24a185e7268f70ee549c7a424b9426b9a18b5706
Author: Ramona Bendias <ramona.bendias@gmail.com>
Date: Mon Feb 21 13:03:52 2022 +0000
Add general `Explainer` Class (#4090)
* Add base Explainer
* Update Explainer
* Fix test
* Clean code
* Update test/nn/models/test_explainer.py
Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
* Update torch_geometric/nn/models/explainer.py
24a185e72 Add general `Explainer` Class (#4090)
6002170a5 Make models compatible to Captum (#3990)
14d588d4c Update attention.py (#4009)
50ff5e6d6 Add `full` extras to install command in contribution docs (#3991)
1e24b3a16 Refactor: `MLP` initialization (#3957)
3e4891be6 Doc improvements to set2set layers (#3889)
fac848c25 Let `TemporalData` inherit from `BaseData` and add docs (#3867)
0c29b0d5b Updated docstring for shape info - part 2 (#3739)
The main differences between git grep
and grep
are:
git grep
only searches through your tracked files, while grep
can search through any files. git grep
is aware of your Git repository structure and can search through old commits, branches, etc., while grep
only searches through the current state of files.
git grep
is faster than grep
when searching through a Git repository because it takes advantage of Git’s index data structure.
CPU times: user 1.52 ms, sys: 4.44 ms, total: 5.96 ms
Wall time: 22.6 ms
There are many ways to search a repository, particularly a Git Repo. We outlined some use cases with examples for a “Unix-like” file directory and also a Git Repo.
In most cases use:
git grep
for searching strings in the repository in the current working tree or a specific commit
git log
for searching across commits.
There are many flags and options for these commands - some combinations which produce the same output. Be sure to check the documentation for more information.
For the strings we are after, the conclusion is:
test_sparse
is in the filename test_sparse.py
and in the contents of the file test_sparse.py
in the repo.convert_coo_to_csr_indices
is not in the contents of any files in the repo.convert_coo_to_csr_indices
are available.The most promising output from the commands tested are:
./test/utils/test_cross_entropy.py:9:def test_sparse_cross_entropy_multiclass(with_edge_label_weight):
./test/utils/test_cross_entropy.py:32:def test_sparse_cross_entropy_multilabel(with_edge_label_weight):
./test/test_edge_index.py:102:def test_sparse_tensor(dtype, device):
./test/test_edge_index.py:992:def test_sparse_narrow(device):
./test/test_edge_index.py:1026:def test_sparse_resize(device):
./torch_geometric/testing/asserts.py:24: test_sparse_layouts: Optional[List[Union[str, torch.layout]]] = None,
./torch_geometric/testing/asserts.py:49: test_sparse_layouts (List[str or int], optional): The sparse layouts to
./torch_geometric/testing/asserts.py:62: if test_sparse_layouts is None:
./torch_geometric/testing/asserts.py:63: test_sparse_layouts = SPARSE_LAYOUTS
./torch_geometric/testing/asserts.py:74: if len(test_sparse_layouts) > 0 and sparse_size is None:
./torch_geometric/testing/asserts.py:75: raise ValueError(f"Got sparse layouts {test_sparse_layouts}, but no "
./torch_geometric/testing/asserts.py:93: for layout in (test_sparse_layouts or []):
Binary file ./.git/index matches
390942fc4
torch_geometric/data/edge_index.py
699120e25
torch_geometric/data/edge_index.py
a6f0f4947
torch_geometric/data/edge_index.py
cf786b735
torch_geometric/data/edge_index.py
b825dc637
torch_geometric/data/edge_index.py
b5ecfd9b4
torch_geometric/data/graph_store.py
torch_geometric/nn/conv/cugraph/base.py
torch_geometric/nn/conv/rgcn_conv.py
torch_geometric/nn/dense/linear.py
Commit: 390942fc4
390942fc4:torch_geometric/data/edge_index.py:344: self._indptr = torch._convert_indices_from_coo_to_csr(
390942fc4:torch_geometric/data/edge_index.py:382: rowptr = self._T_indptr = torch._convert_indices_from_coo_to_csr(
390942fc4:torch_geometric/data/edge_index.py:403: colptr = self._T_indptr = torch._convert_indices_from_coo_to_csr(
390942fc4:torch_geometric/utils/sparse.py:480: return torch._convert_indices_from_coo_to_csr(
Commit: 699120e25
699120e25:torch_geometric/data/edge_index.py:323: self._rowptr = rowptr = torch._convert_indices_from_coo_to_csr(
699120e25:torch_geometric/data/edge_index.py:351: self._rowptr = rowptr = torch._convert_indices_from_coo_to_csr(
699120e25:torch_geometric/data/edge_index.py:375: self._colptr = colptr = torch._convert_indices_from_coo_to_csr(
699120e25:torch_geometric/data/edge_index.py:403: self._colptr = colptr = torch._convert_indices_from_coo_to_csr(
699120e25:torch_geometric/utils/sparse.py:480: return torch._convert_indices_from_coo_to_csr(
Commit: a6f0f4947
a6f0f4947:torch_geometric/data/edge_index.py:321: self._rowptr = torch._convert_indices_from_coo_to_csr(
a6f0f4947:torch_geometric/data/edge_index.py:352: self._rowptr = torch._convert_indices_from_coo_to_csr(
a6f0f4947:torch_geometric/data/edge_index.py:379: self._colptr = torch._convert_indices_from_coo_to_csr(
a6f0f4947:torch_geometric/data/edge_index.py:410: self._colptr = torch._convert_indices_from_coo_to_csr(
a6f0f4947:torch_geometric/utils/sparse.py:480: return torch._convert_indices_from_coo_to_csr(
Commit: cf786b735
cf786b735:torch_geometric/data/edge_index.py:236: self._rowptr = torch._convert_indices_from_coo_to_csr(
cf786b735:torch_geometric/data/edge_index.py:255: self._colptr = torch._convert_indices_from_coo_to_csr(
This is a good starting point for debugging the issues with the unit tests in the library. Useful and informative :=)
And finally some clean up: