I work on a major private Python project in an exploratory way for several months using Pycharm.
I use git to track the changes in that project.
I am the only person that contributes to that project.
Normally I commit changes to that project roughly once a day.
Now I want to track changes of my code every time I execute my code (the background is that I sometimes get lost which intermediate result was achieved using which version of my code).
Therefore I want to perform a git commit of all changed files at the end of the script execution.
As now every commit just gets a 'technical' commit message, I would like to distinguish these 'technical' commits from the other commits I do roughly once a day (see above). The background is that I still would like to see and compare the daily differences from each other. The technical commits might sum up to several dozens per day and would hinder me to see the major changes over the course of time.
Which techniques does git offer to distinguish the technical commits from the daily commits I do?
Is maybe branching a valid approach for this? If yes, would I delete these branches later on? (I am a git novice)
You could use a branch for that, yes. Just use a working branch when doing your scripted autocommits, and then when you want to make a commit for the history, switch to your main branch.
To get to re-add the final changes as a single commit, one way would be to soft reset the history when you are done with the changes. So you would:
git reset prev-real-commit
Which jumps the history back to before your new batch of wip auto commits, but does not touch the files so you don't loose work. Then you can make a new commit normally for the changes.
That technique also works without a branch. Using a branch might still be nice though so you can easily check what the version was before your new wip commits.
Git also has rebasing which would allow squashing multiple commits to one and rewriting the messages. But for the workflow you describe, I think simply reseting the autocommits away and redoing a normal commit is better.
Also the suggestion to add some tag to the message of the autocommits is good.
That said, I usually just commit the checkpoints that I need in normal dev flow. It can be nicer to have a commit e.g. every hour instead of only once a day. Small atomic commits are good. You can use feature branches and on GitHub pull requests if you want to record and manage larger wholes of work.
I think even if you work on this project alone it might still be a good idea to adopt typical github flow approach - and start using branches.
The idea is that you distinguish your "technical" commits (many issued throughout the day) from your daily commits (rarely more than one) in terms of Git entities used:
your main code stays in master branch
your daily commits remain 'normal' commits going into a specific long-running branch (develop is a common name)
your 'once-a-day' commit becomes a merge commit, pushing all the changes in develop into master branch
This allows to you to save the history - yet see a clear distinction between those two types. You can opt for 'no fast forward' approach, so that each merge commit becomes clearly distinct from 'regular' ones.
And if you actually don't want all the history to be there (as #antont said, there might be a LOT of commits), you might consider 'squashing' those commits when either merging or rebasing, like described here.
Related
So e.g. you are working # Google in the YouTube team and you want to modify how the search bar looks like, or just want to change the font size, or work on a major project like the recommender system etc., does making a Git branch copy over ALL of the backend code for YouTube on your machine? So if there are 100 engineers working from their laptops in the YouTube team, are there 100 copies of YouTube code on their tiny laptops in circulation? Because as I understand Git, when you branch off, you create a copy of the source code, which you merge back into the production branch, which merges into the master branch.
Please correct me if I am wrong as I have only worked on MUCH smaller projects which use Git (~100 files, ~15k lines of code).
Your support will be much appreciated.
Thanks.
Creating a branch in Git copies nothing.
OK, this is a bit of an overstatement. It copies one hash ID. That is, suppose you have an existing repository with N branches. When you create a new branch, Git writes one new file holding a short (currently 40-byte long, eventually to be 64-byte long) hash ID. So if your previous disk usage was 50 megabytes, your new disk usage is ... 50 megabytes.
On the other hand, cloning a repository copies everything. If the repository over on Server S is 50 megabytes, and you clone it to Laptop L, the repository on Laptop L is also 50 megabytes.1 There are ways to reduce the size of the clone (by omitting some commits), but they should be used with care. In any case, these days 50 megabytes is pretty small anyway. :-)
There's a plan in the works for Git to perform a sort of mostly-delayed cloning, where an initial clone copies some of the commits and replaces all the rest with a sort of IOU. This is not ready for production yet, though.
The way to understand all of this is that Git does not care about files, nor about branches. Git cares about commits. Commits contain files, so you get files when you get commits, and commits are identified by incomprehensible hash IDs, so we have branch names with which to find the hash IDs. But it's the commits that matter. Creating a new branch name just stores one existing commit hash ID into the new branch name. The cost of this is tiny.
1This isn't quite guaranteed, due to the way objects stored in Git repositories get "packed". Git will run git gc, the Garbage Collector, now and then to collect and throw out rubbish and shrink the repository size, and depending on how much rubbish there is in any given repository, you might see different sizes.
There have been various bugs in which Git didn't run git gc --auto often enough (in particular, up through 2.17 git commit neglected to start an auto-gc afterward) or in which the auto-gc would never finish cleaning up (due to left-over failure log from an earlier gc, fixed in 2.12.2 and 2.13.0). In these cases a clone might wind up much smaller than the original repository.
Is there a way to merge the change logs from several different Mercurial repositories? By "merge" here I just mean integrate into a single display; this is nothing to do with merging in the source control sense.
In other words, I want to run hg log on several different repositories at once. The entries should be sorted by date regardless of which repository they're from, but be limited to the last n days (configurable), and should include entries from all branches of all the repositories. It would also be nice to filter by author and do this in a graphical client like TortoiseHg. Does anyone know of an existing tool or script that would do this? Or, failing that, a good way to access the log entries programmically? (Mercurial is written in Python, which would be ideal, but I can't find any information on a simple API for this.)
Background: We are gradually beginning to transition from SVN to Mercurial. The old repository was not just monolithic in the sense of one server, but also in the sense that there was one huge repository for all projects (albeit with a sensible directory structure). Our new Mercurial repositories are more focused! In general, this works much better, but we miss one useful feature from SVN: being able to use svn log at the root of the repository to see everything we have been working on recently. It's very useful for filling in timesheets, giving yourself a sense of purpose, etc.
I figured out a way of doing this myself. In short, I merge all the revisions into one mega-repo, and I can then look at this in TortoiseHG. Of course, it's a total mess, but it's good enough to get a summary of what happened recently.
I do this in three steps:
(Optional) Run hg convert on each source repository using the branchmap feature to rename each branch from original to reponame/original. This makes it easier later to identify which revision came from which source repository. (More faithful to SVN would be to use the filemap feature instead.)
On a new repository, run hg pull -f to force-pull from the individual repositories into a one big one. This gets all the revisions in one place, but they show up in the wrong order.
Use the method described in this answer to create yet another repository that contains all the changes from the one created in step 2 but sorted into the right order. (Actually I use a slight variant: I get the hashes and compare against the hashes in the destination, check that the destination has a prefix of the source's, and only copy the new ones across.)
This is all done from a Python script, but although Mercurial is written in Python I just use the command line interface using the subprocess module. Running through the three steps only copies the new revisions without rebuilding everything from scratch, unless you add a new repo.
I have a Python project that I'm working on for research. I've been working on two different machines, and I recently discovered that half of my files used tabs and the other half used spaces.
Python objected to this when I attempted to edit and run a file from one machine on the other, so I'd like to switch everything to spaces instead of tabs. However, this seems like a waste of a Git commit - running 'git diff' on the uncommitted-but-correct files makes it look like I'm wiping out and replacing the entire file.
Is there a way around this? That is, is there some way that I can "hide" these (IMO) frivolous changes?
There is, it's called a rebase - however, you'd probably want to add spaces to every file retroactively in each commit where you edited that file, which would be extremely tedious.
However, there is absolutely nothing wrong with having a commit like this. A commit represents a change a distinct functioning state of your project, and replacing tabs with spaces is definitely a distinct state.
One place where you would want to use rebase is if you accidentally make a non-functional commit. For example, you might have committed only half the files you need to.
One last thing: never edit history (i.e. with rebase) once you've pushed your changes to another machine. The machines will get out of sync, and your repo will start slowly exploding.
Unfortunately there is no way around the fact that at the textual level, this is a big change. The best you can do is not mix whitespace changes with any other changes. The topic of a such a commit should be nothing but the whitespace change.
If this screwup is unpublished (only in your private repos), you can go back in time and fix the mess in that point in the history when it was introduced, and then go through the pain of fixing up the subsequent changes (which have to be re-worked in the correct indentation style). For the effort, you end up with a clean history.
It is perfectly valid. Coupling a white space reformat with other changes in the same file could obfuscate the non white space changes. The commit has a single responsibility to reformat white space.
There's a rather large oldish Python project that historically has the most (95%+) of the code base using tabs for indentation. Mercurial is used as a VCS.
There're several inconveniences in using tabs. It seems that 4 spaces became a prevailing indentation way within Python community, and most of code analysing/formatting software messes up with tabs one way or another. Also, most (pretty much all, actually) of team members that are working on the project are preferring spaces to tabs, thus would like to switch.
So, there's this fear of losing the ability to track who was the latest modifier of a specific line of code... because if all of the lines of code are converted to use spaces-based indentation from using tabs-based one, and then the change gets committed to the mercurial repository, that's exactly what's going to happen. And this feature (hg annotate) is too useful to consider a possibility of sacrificing it.
Is there a way to switch the indentation method across the project without losing the Mercurial hg annotate functionality? If there is, what would be the most painless way?
If you were to do the replacement of each tab with 4 spaces, you could still get a reasonably correct result from annotate, just use the switch that ignores changes in whitespace:
hg annotate -b text.txt
You could also use -w to ignore all whitespace in the comparison, but -b appeared to the best match: ignoring the case when some whitespace was changed into a different whitespace.
This would, however, ignore all lines where only whitespace had been altered, which would ignore changes in indentation and leave them attributed to the previous alteration of the line.
See hg help annotate for more.
You could create a new repository and, by using suitable scripts, populate it with each commit of the previous history BUT with the files automatically modified from what was actually committed to the same with the tabs replaced. Basically your script would need to checkout the initial file set and get the commit details replace any tabs in the file set and then commit to the new repository with the original commit details. It would then move on to the next change set, generate and apply paches, filter for tabs again and commit, etc. There is a blog here about doing something similar.
You could do this offline and autmatically and on an agreed upon date replace the repositories on your server, (keeping a copy of course), with the modified one - just remember to tell your team that they need to pull before doing any work the next day.
I would strongly recommend implementing pre-commit hooks, so as to ensure that you do not get polluted should anybody try checking in an old format file. They would probably be worth having in place on the new repository before starting the process.
UPDATE
Having written the above I finally came up with the correct search terms and found you hg_clone which should do exactly what you need, to quote the opening comments:
# Usage: hg-clone.rb SOURCE-DIR DEST-DIR --target TARGET-REVISION --cutoff CUTOFF-REVISION --filter FILTER
#
# This program clones a mercurial repository to a new directory, changeset by changeset
# with the option of running a filter before each commit. The filter can be used for
# example to strip out secret data (such as code for unused platforms) from the code.
#
# --target TARGET-REVISION
# The revision that the DEST-DIR should be updated to. This revision, and all its parent revisions
# will be copied over to the dest dir. If no TARGET-REVISION is specified, the latest revision in
# the repositiory will be used.
#
# --cutoff CUTOFF-REVISION
# If specified, this should be a short revision number that acts as a cutoff for synching. If you
# specify 100 for instance, no revisions before 100 will be brought over to DEST-DIR. Revisions
# that have parents earlier than revision 100 will be reparented to have 100 as their revision.
#
# --filter FILTER
# Specifies a program that should be run in the DEST-DIR before committing each revision.
From someone who has a django application in a non-trivial production environment, how do you handle database migrations? I know there is south, but it seems like that would miss quite a lot if anything substantial is involved.
The other two options (that I can think of or have used) is doing the changes on a test database and then (going offline with the app) and importing that sql export. Or, perhaps a riskier option, doing the necessary changes on the production database in real-time, and if anything goes wrong reverting to the back-up.
How do you usually handle your database migrations and schema changes?
I think there are two parts to this problem.
First is managing the database schema and it's changes. We do this using South, keeping both the working models and the migration files in our SCM repository. For safety (or paranoia), we take a dump of the database before (and if we are really scared, after) running any migrations. South has been adequate for all our requirements so far.
Second is deploying the schema change which goes beyond just running the migration file generated by South. In my experience, a change to the database normally requires a change to deployed code. If you have even a small web farm, keeping the deployed code in sync with the current version of your database schema may not be trivial - this gets worse if you consider the different caching layers and effect to an already active site user. Different sites handle this problem differently, and I don't think there is a one-size-fits-all answer.
Solving the second part of this problem is not necessarily straight forward. I don't believe there is a one-size-fits-all approach, and there is not enough information about your website and environment to suggest a solution that would be most suitable for your situation. However, I think there are a few considerations that can be kept in mind to help guide deployment in most situations.
Taking the whole site (web servers and database) offline is an option in some cases. It is certainly the most straight forward way to manage updates. But frequent downtime (even when planned) can be a good way to go our of business quickly, makes it tiresome to deploy even small code changes, and might take many hours if you have a large dataset and/or complex migration. That said, for sites I help manage (which are all internal and generally only used during working hours on business days) this approach works wonders.
Be careful if you do the changes on a copy of your master database. The main problem here is that your site is still live, and presumably accepting writes to the database. What happens to data written to the master database while you are busy migrating the clone for later use? Your site has to either be down the whole time or put in some read-only state temporarily otherwise you'll lose them.
If your changes are backwards compatible, and you have a web farm, sometimes you can get away with updating the live production database server (which I think is unavoidable in most situations) and then incrementally updating nodes in the farm by taking them out of the load balancer for a short period. This can work ok - however the main problem here is if a node that has already been updated sends a request for a url which isn't supported by an older node you will get fail as you cant manage that at the load balancer level.
I've seen/heard a couple of other ways work well.
The first is wrapping all code changes in a feature lock which is then configurable at run-time through some site-wide configuration options. This essentially means you can release code where all your changes are turned off, and then after you have made all the necessary updates to your servers you change your configuration option to enable the feature. But this makes quite heavy code...
The second is letting the code manage the migration. I've heard of sites where changes to the code is written in such a way that it handles the migration at runtime. It is able to detect the version of the schema being used, and the format of the data it got back - if the data is from the old schema it does the migration in place, if the data is already from the new schema it does nothing. From natural site usage a high portion of your data will be migrated by people using the site, the rest you can do with a migration script whenever you like.
But I think at this point Google becomes your friend, because as I say, the solution is very context specific and I'm worried this answer will start to get meaningless... Search for something like "zero down time deployment" and you'll get results such as this with plenty of ideas...
I use South for a production server with a codebase of ~40K lines and we have had no problems so far. We have also been through a couple of major refactors for some of our models and we have had zero problems.
One thing that we also have is version control on our models which helps us revert any changes we make to models on the software side with South being more for the actual data. We use Django Reversion
I have sometimes taken an unconventional approach (reading the other answers perhaps it's not that unconventional) to this problem. I never tried it with django so I just did some experiments with it.
In short, I let the code catch the exception resulting from the old schema and apply the appropriate schema upgrade. I don't expect this to be the accepted answer - it is only appropriate in some cases (and some might argue never). But I think it has an ugly-duckling elegance.
Of course, I have a test environment which I can reset back to the production state at any point. Using that test environment, I update my schema and write code against it - as usual.
I then revert the schema change and test the new code again. I catch the resulting errors, perform the schema upgrade and then re-try the erring query.
The upgrade function must be written so it will "do no harm" so that if it's called multiple times (as may happen when put into production) it only acts once.
Actual python code - I put this at the end of my settings.py to test the concept, but you would probably want to keep it in a separate module:
from django.db.models.sql.compiler import SQLCompiler
from MySQLdb import OperationalError
orig_exec = SQLCompiler.execute_sql
def new_exec(self, *args, **kw):
try:
return orig_exec(self, *args, **kw)
except OperationalError, e:
if e[0] != 1054: # unknown column
raise
upgradeSchema(self.connection)
return orig_exec(self, *args, **kw)
SQLCompiler.execute_sql = new_exec
def upgradeSchema(conn):
cursor = conn.cursor()
try:
cursor.execute("alter table users add phone varchar(255)")
except OperationalError, e:
if e[0] != 1060: # duplicate column name
raise
Once your production environment is up to date, you are free to remove this self-upgrade code from your codebase. But even if you don't, the code isn't doing any significant unnecessary work.
You would need to tailor the exception class (MySQLdb.OperationalError in my case) and numbers (1054 "unknown column" / 1060 "duplicate column" in my case) to your database engine and schema change, but that should be easy.
You might want to add some additional checks to ensure the sql being executed is actually erring because of the schema change in question rather than some other problem, but even if you don't, this should re-raise unrelated exception. The only penalty is that in that situation you'd be trying the upgrade and the bad query twice before raising the exception.
One of my favorite things about python is one's ability to easily override system methods at run-time like this. It provides so much flexibility.
If your database is non-trivial and Postgresql you have a whole bunch of excellent options SQL-wise, including:
snapshotting and rollback
live replication to a backup server
trial upgrade then live
The trial upgrade option is nice (but best done in collaboration with a snapshot)
su postgres
pg_dump <db> > $(date "+%Y%m%d_%H%M").sql
psql template1
# create database upgrade_test template current_db
# \c upgradetest
# \i upgrade_file.sql
...assuming all well...
# \q
pg_dump <db> > $(date "+%Y%m%d_%H%M").sql # we're paranoid
psql <db>
# \i upgrade_file.sql
If you like the above arrangement, but you are worried about the time it takes to run upgrade twice, you can lock db for writes and then if the upgrade to upgradetest goes well you can then rename db to dbold and upgradetest to db. There are lots of options.
If you have an SQL file listing all the changes you want to make, an extremely handy psql command \set ON_ERROR_STOP 1. This stops the upgrade script in its tracks the moment something goes wrong. And, with lots of testing, you can make sure nothing does.
There are a whole host of database schema diffing tools available, with a number noted in this StackOverflow answer. But it is basically pretty easy to do by hand ...
pg_dump --schema-only production_db > production_schema.sql
pg_dump --schema-only upgraded_db > upgrade_schema.sql
vimdiff production_schema.sql upgrade_schema.sql
or
diff -Naur production_schema.sql upgrade_schema.sql > changes.patch
vim changes.patch (to check/edit)
South isnt used everywhere. Like in my orgainzation we have 3 levels of code testing. One is local dev environment, one is staging dev enviroment, and third is that of a production .
Local Dev is on the developers hands where he can play according to his needs. Then comes staging dev which is kept identical to production, ofcourse, until a db change has to be done on the live site, where we do the db changes on staging first, and check if everything is working fine and then we manually change the production db making it identical to staging again.
If its not trivial, you should have pre-prod database/ app that mimic the production one. To avoid downtime on production.