How to organize and categorize small projects in GitHub?

How to organize and categorize small projects in GitHub? - python

I'm new to development and trying to upload small projects that I've worked on to my GitHub profile. These projects are not dependent on each other.
My issue is that some of them are small single-file projects. Sort of like mini challenges that I've solved. So I'm thinking of grouping them together under one repo called "Python programming", for example.
Is this a good practice?
If yes, how should I go about it in Git, and
how can I still have a README file showing up for each mini project.
If no, what would you recommend doing?

GitHub will render a README file for every folder you visit, so when using just a single repository, one solution would be to still create one sub folder for each “subproject” that as such can have its own README file.
But before going that route, you should think about if those small projects actually belong together. That’s ultimately what should decide whether you want to put them all in the same repository or whether you want to split it up into several repositories.
Some things to consider for that decision:
If the projects do not depend on another, do they still relate to another? For example, are those projects part of a bigger programming challenge like Project Euler and you’re just collecting all your solutions? Then a single repository might make more sense.
What is the chance for individual projects to grow into bigger things? Many things start very small but can eventually grow into real things that justify their own repository. At that point, you might even get others to contribute.
Does it make sense for those individual files to share a history? Are the files even going to be edited once they are “done”? I.e. is this just a collection of finished things, or are they actually ongoing experiments?
Ultimately, it comes down to your personal choice. But GitHub, as the repository hoster, should not be driving your decision. You should create Git repositories locally as it makes sense to you. If that means you just have a single one, that’s fine. If that means you create lots of them, that’s also fine.
Unfortunately, the GitHub UI is not really made for small one-off projects. The repository list is just to unorganized for that. If you decide to use small projects, I advise you to add some prefix for categorization within your GitHub profile, so you know what this is about.
A good alternative for one-off projects, especially when it’s just a single (or a few) files are Gists. Gists are born as a way to share code snippets but under the hood, every Gist is actually a full Git repository. Of course, Gists do not offer the tools normal repositories on GitHub have (e.g. issues, pull requests, wikis). But for what you describe, you probably need neither of those. Then, Gists are a fine way to share simple things without adding full repositories to your profile. And you can still clone them (the remote URL is git#gist.github.com:/<gist-id>.git) and have a full history and support for multiple files if you need those.

Commonly, you'll see that the top level of the repo contains the README file, maybe a setup.py and some other extraneous information, and perhaps a tests folder. Then there will be a folder that shares a name with the repo. Inside of that folder is the code that's intended to be core content of the module/package/script.
It's also not unusual to see different organization, particularly with very small projects of single-file scripts.
For the specific case you mention, do whatever you like. What you propose sounds totally reasonable to me. I would not want to have a separate repo for all the challenges I solve!
I usually use a gist for trivial items I don't necessarily want to make a repo for, including coding challenges. So I would offer that as an alternative. Do whatever suits you best.

Related

YAPF Formatting on Commits

I am currently working on a tool being used for Python developers, who use Spyder IDE. We want a consistent format. They are not willing to change IDE's that have plugins to automatically do this.
I have been testing the YAPF library, and am looking to find a way that anytime that a commit or push happens to GitLab, it automatically formats it in this way.
Do I need some workflow? Is this considered simular to CI Pipelines? I am unsure how to tackle this.
Any feedback is helpful, and greatly appreciated.

I have been testing the YAPF library, and am looking to find a way that anytime that a commit or push happens to GitLab, it automatically formats it in this way.
You don't want to do that: you don't want your CI environment to modify commits, because either (a) you'll end up duplicating every commit with a second "fix the formatting commit", or (b) you'll make the history of your gitlab repository diverge from the history of the person who submitted the change (if you replace the commit).
The typical solution here is to update your CI to reject pull-/merge-requests that don't meet your formatting standards. At the same time, provide developers with a tool they can run locally that will apply the correct formatting to files before they submit a change request.
If your developers are committing directly to the repository, rather than operating through merge requests, your options are far more limited (at that point, you have a social problem, not a technical problem).

Rename folder in git without changing the contributors

I have a problem: we are using a package that is not maintained for a while now. So we forked it in order to maintain it ourselves. The package already exists lets say it is named package_a. Most of the code and the __init__ are in the package_a/ folder.
Now we want to make our own package that will include our maintained code and we want to name is package_b. So far so good but the problems is that package_b wants to have the code and the __init__ in package_b/ folder and github changes the contributions for all files when a folder is renamed. And I would like that credit for contributions stays where it is due, the 10k+ lines of code didn't just appear in my local repo out of thin air. Any suggestions how we can have package named package_b but keep the code in the original folder package_a/?
I am thinking along the lines of trying with some clever way of importing package_a into package_b or something along the line but I hope for a definite answer.

Instead of copying the code or trying to import A into B, extract the common code into a 3rd package which both A and B import. Or perhaps a subclass. This doesn't solve your contribution problem, but it does avoid making a big maintenance hassle by copying and pasting 10,000 lines of code.
Git doesn't record copies and renames, but it can recognize when they happen. To give Git the best chance of recognizing a copy, do only the copy in its own commit. Make no changes to the content. Then in a second commit make any necessary changes to the copied code.
In normal Git you can nudge git log and git blame to honor copies and renames with -C. Git doesn't do this by default because it's more expensive.
Github will do what Github will do.
Regardless of who Github says who wrote what line their contributions will still be in the project history. That's how it goes. You make your contribution and then others put their own work on top of it. This is normal. Their contributions remain in the history.
"History sheer" is also normal, that's when a change touches many lines but is otherwise insignificant. For example, if you were to restyle the code that would cause a history sheer. git blame will say that was the last commit to touch the code. git blame -w mitigates this somewhat, and Github has an "ignore whitespace" option. History sheer is normal and so is learning to skip over it.
The tools work for us. Don't bend yourself for the benefit of the tools.
If you want to make a special shout out to your contributors, make a contributor's section to your README.md.

How can I have Enterprise and Public version of Django application sharing some code?

I'm building a webapp using Django which needs to have two different versions: an Enterprise version and a standard public version. Up until now, I've been only developing the Enterprise version and am now looking for the best way to separate the two versions in the simplest way while avoiding duplication of code as much as possible. The main difference between the two versions will be that they need different URLs and different Views. I intend to differentiate based on subdomain using a multi-tenant architecture, where the www.example.com is the public version, and company1.example.com hits the enterprise version.
I've come up with a couple potential solutions, but I'm not happy with any of them.
Separate Git repositories and entirely separate projects, with all common code duplicated. This much duplication of code is bound to be error prone where things will get out of sync and is expected to be ridden with copy-paste mistakes. This is a last-resort solution.
Separate Git repositories, with common code shared via Git Submodules (a single common 'base' repository containing base models and shared views). I've read horror stories about git submodules, though, so I'm wary of this solution.
Single Git repository containing multiple 'project' folders (public/enterprise) each with their own base urls.py, settings.py, wsgi.py, etc...) and multiple manage.py files to choose which "Project" to run. I'm afraid that this solution would become an utter mess because it wouldn't be possible to have the public and enterprise versions use different versions of the common library if one needs an update before the other.
Separate Git repositories, with all shared code developed as 'Re-usable apps' and installed into the python path. This would be a somewhat clean solution, but would be difficult to work with any time changes needed to be made to the common modules.
Single project where all features are managed via conditional logic in the views. This would be most prone to bugs and confusion of all, and I'd prefer to avoid this solution.
Does anyone have any experience with this type of solution or could anyone help me find the best solution to this problem?

What about "a single Git repository, with all shared code developed as 'Re-usable apps'"? That is configure the options enabled with the INSTALLED_APPS setting.
First you need to decide on your release process. If you intend on releasing both versions simultaneously, using the one git repository makes sense.
An overriding concern might be if you have different distribution requirements for the code, e.g. if you want the code in the public version to be publicly available and the enterprise version to be private. Then you might have to use two git repositories.

Have you looked into using git subtree? It's an alternative to submodules, and it makes the process a little less complicated. I think Atlassian does a great job of explaining how it's used and the pros and cons. A few examples are:
"Contents of the module can be modified without having a separate repository copy of the dependency somewhere else."
"The sub-project’s code is available right after the clone of the super project is done."
"Management of a simple workflow is easy."
The Atlassian link is here.
Here's also a link to git-subtree's description file.

Probably the best solution is to identify exactly which code is shared between the two projects and make that a reusable app.
Then each installation can install that django app, and then has their own site specific code as well.

Cleanup Huge Git Repository

My company has a single git repository which is over 15 years and is really massive with about 60% of it which can be archived. I want to find these scripts (python, perl, ruby, java etc) and create a new git repository with only frequently used scripts. The scripts also have cross dependencies.
One solution that I thought was to setup inotify to watch over the files in git repo and collect the names of recently accessed scripts, collect data over few months and then create new repo based on that data. Not sure how efficient it would be though.
Another solution I thought was to use git commit date for each file and remove files which are over 5 years old.
Could anyone let me know of an efficient solution to cleanup this mess ? Or any tool similar to NewRelic that would monitor the filesystem ?

First, it's not clear what problem you are trying to solve. Is the 15-year git history slowing things down when cloning? If so, maybe just do a shallow git clone instead? (i.e. A shallow cone doesn't download the history.)
As Thilo pointed out, cutting the repo in half isn't going to make things that much faster.
But if the scripts are really that disorganized, it's highly likely that some of them need to be rewritten, documented, etc. If you just move the scripts forward, it's likely you are moving lots of inefficiencies forward too. I'd pick them off one at at time, and give them a little love, test them, etc.
One idea: You can use strace -ff -o strace.out ./myscript to figure out what other files a script opens.

Organizing Python projects with shared packages

What is the best way to organize and develop a project composed of many small scripts sharing one (or more) larger Python libraries?
We have a bunch of programs in our repository that all use the same libraries stored in the same repository. So in other words, a layout like
trunk
libs
python
utilities
projects
projA
projB
When the official runs of our programs are done, we want to record what version of the code was used. For our C++ executables, things are simple because as long as the working copy is clean at compile time, everything is fine. (And since we get the version number programmatically, it must be a working copy, not an export.) For Python scripts, things are more complicated.
The problem is that, often one project (e.g. projA) will be running, and projB will need to be updated. This could cause the working copy revision to appear mixed to projA during runtime. (The code takes hours to run, and can be used as inputs for processes that take days to run, hence the strong traceability goal.)
My current workaround is, if necessary, check out another copy of the trunk to a different location, and run off there. But then I need to remember to change my PYTHONPATH to point to the second version of lib/python, not the one in the first tree.
There's not likely to be a perfect answer. But there must be a better way.
Should we be using subversion keywords to store the revision number, which would allow the data user to export files? Should we be using virtualenv? Should we be going more towards a packaging and installation mechanism? Setuptools is the standard, but I've read mixed things about it, and it seems designed for non-developer end users (of which we have none).

The much better solution involves not storing all your projects and their shared dependencies in the same repository.
Use one repository for each project, and externals for the shared libraries.
Make use of tags in the shared library repositories, so consumer projects may use exactly the version they need in their external.
Edit: (just copying this from my comment) use virtualenv if you need to provide isolated runtime environments for the different apps on the same server. Then each environment can contain a unique version of the library it needs.

If I'm understanding your question properly, then you definitely want virtualenv. Add in some virtualenvwrapper goodness to make it that much better.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.