how to flip columns in reverse order using shell scripting/python - python

Dear experts i have a small problem where i just want to reverse the columns.For example i have a data sets arranged in 4 columns i need to put last column first, and so on reversely...how can this work be done...i hope some expert will definitely answer my questions.Thanks
in put data example
1 2 3 4 5
6 7 8 9 0
3 4 5 2 1
5 6 7 2 3
i need output like as below
5 4 3 2 1
0 9 8 7 6
1 2 5 4 3
3 2 7 6 5

Perl to the rescue!
perl -lane 'print join " ", reverse #F' < input-file
-n reads the file line by line, running the code specified after -e for each line
-l removes newlines from input and adds them to output
-a splits the input line on whitespace populating the #F array
reverse reverses the array
join turns a list to a string

What is the type of your data? In python, if your data is a Numpy array then just do data[:, ::-1]. It also work list (but for the first dimension obsviously. In fact it is the general behavior of Python Slice (first, end, stride), where first and last are omitted. It works with any object supported indexing.
But if it is the only data manipulation that you have to do, it may be overkill to use Python to do it. However, it may be more efficient than raw string manipulation (using perl or whatever) depending of the size of your data.

Related

Cluster similar - but not identical - digits in pandas dataframe

I have a pandas dataframe with 2M+ rows. One of the columns pin, contains a series of 14 digits.
I'm trying to cluster similar — but not identical — digits. Specifically, I want to match the first 10 digits without regard to the final four. The pin column was imported as an int then converted to a string.
Put another way, the first 10 digits should match but the final four shouldn't. Duplicates of exact-matching pins should be dropped.
For example these should all be grouped together:
17101110141403
17101110141892
17101110141763
17101110141199
17101110141788
17101110141851
17101110141831
17101110141487
17101110141914
17101110141843
Desired output:
Biggest cluster | other columns
Second biggest cluster | other columns
...and so on | other columns
I've tried using a combination of groupby and regex without success.
pat2 = '1710111014\d\d\d\d'
pat = '\d\d\d\d\d\d\d\d\d\d\d\d\d\d'
grouped = df2.groupby(df2['pin'].str.extract(pat, expand=False), axis= 1)
and
df.groupby(['pin']).filter(lambda group: re.match > 1)
Here's a link to the original data set: https://datacatalog.cookcountyil.gov/Property-Taxation/Assessor-Parcel-Sales/wvhk-k5uv
It's not clear why you need regex for this, what about the following(assuming pin is stored as a string) (Note that you haven't included your expected output)
pin
0 17101110141403
1 17101110141892
2 17101110141763
3 17101110141199
4 17101110141788
5 17101110141851
6 17101110141831
7 17101110141487
8 17101110141914
9 17101110141843
df.groupby(df['pin'].str[:10]).size()
pin
1710111014 10
dtype: int64
If you want this information appended back to your original dataframe, you can use
df['size']=df.groupby(df['pin'].astype(str).str[:10])['pin'].transform(len)
pin size
0 17101110141403 10
1 17101110141892 10
2 17101110141763 10
3 17101110141199 10
4 17101110141788 10
5 17101110141851 10
6 17101110141831 10
7 17101110141487 10
8 17101110141914 10
9 17101110141843 10
Then, assuming you have more columns, you can sort your dataframe by size of cluster with
df.sort_values('size')

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

Creating columns with numpy Python

I have some elements stored in numpy.array[]. I wish to store them in a ".txt" file. The case is it needs to fit a certain standard, which means each element needs to be stored x lines into the file.
Example:
numpy.array[0] needs to start in line 1, col 26.
numpy.array[1] needs to start in line 1, col 34.
I use numpy.savetxt() to save the arrays to file.
Later I will implement this in a loop to create a lagre ".txt" file with coordinates.
Edit: This good example was provided below, it does point out my struggle:
In [117]: np.savetxt('test.txt',A.T,'%20d %10d')
In [118]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
The fmt option '%20d %10d' gives you spacing which depend on the last integer. What I need is an option which lets me set the spacing from the left side regardless of other integers.
Template is need to fit integers into:
XXXXXXXX.XXX YYYYYYY.YYY ZZZZ.ZZZ
Final Edit:
I solved it by creating a test which checks how many spaces the last float used. I was then able to predict the number of spaces the next float needed to fit the template.
Have you played with the fmt of np.savetxt?
Let me illustrate with a concrete example (the sort that you should have given us)
Make a 2 row array:
In [111]: A=np.arange((12)).reshape(2,6)
In [112]: A
Out[112]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
Save it, and get 2 rows, 6 columns
In [113]: np.savetxt('test.txt',A,'%d')
In [114]: cat test.txt
0 1 2 3 4 5
6 7 8 9 10 11
save its transpose, and get 6 rows, 2 columns
In [115]: np.savetxt('test.txt',A.T,'%d')
In [116]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
Put more detail into fmt to space out the columns
In [117]: np.savetxt('test.txt',A.T,'%20d %10d')
In [118]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
I think you can figure out how to make a fmt string that puts your numbers in the correct columns (join 26 spaces etc, or use left and right justification - the usual Python formatting issues).
savetxt also takes an opened file. So you can open a file for writing, write one array, add some filler lines, and write another. Also, savetxt doesn't do anything fancy. It just iterates through the rows of the array, and writes each row to a line, e.g.
for row in A:
file.write(fmt % tuple(row))
So if you don't like the control that savetxt gives you, write the file directly.

counting T/F values for several conditions

I am a beginner using pandas.
I'm looking for mutations on several patients. I have 16 different conditions. I simply write a code about it but how can do this by for loop? I try to find the changes on MUT column and set them as True and False. Then try to count the True/False numbers. I have done for only 4.
Can you suggest a more simple way, instead of writing the same code 16 times?
s1=df["MUT"]
A_T= s1.str.contains("A:T")
ATnum= A_T.value_counts(sort=True)
s2=df["MUT"]
A_G=s2.str.contains("A:G")
AGnum=A_G.value_counts(sort=True)
s3=df["MUT"]
A_C=s3.str.contains("A:C")
ACnum=A_C.value_counts(sort=True)
s4=df["MUT"]
A__=s4.str.contains("A:-")
A_num=A__.value_counts(sort=True)
I'm not an expert with using Pandas, so don't know if there's a cleaner way of doing this, but perhaps the following might work?
chars = 'TGC-'
nums = {}
for char in chars:
s = df["MUT"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums['T']
AGnum = nums['G']
# ...etc
Basically, go through each unique character (T, G, C, -) then pull out the values that you need, then finally stick the numbers in a dictionary. Then, once the loop is finished, you can fetch whatever numbers you need back out of the dictionary.
Just use value_counts, this will give you a count of all unique values in your column, no need to create 16 variables:
In [5]:
df = pd.DataFrame({'MUT':np.random.randint(0,16,100)})
df['MUT'].value_counts()
Out[5]:
6 11
14 10
13 9
12 9
1 8
9 7
15 6
11 6
8 5
5 5
3 5
2 5
10 4
4 4
7 3
0 3
dtype: int64

Least writes sort for a linked list python

I have a linked list: 1 3 7 4 5 0 2 6. I wish to sort it in the least number of moves/writes.
I can only move elements of that list using a function insertBetween(element, after, before) which simply inserts element in between the after and before elements.
For example, the first run might want to move 1 after 0 and before 2:
insertBetween(1, 0, 2)
The list will now be 3 7 4 5 0 1 2 6.
Now it might want to move 7 to the end: insertBetween(7, 6, None)
The list will now be 3 4 5 0 1 2 6 7.
Now it might want to move 0 to the start: insertBetween(0, None, 3)
The list will now be 0 3 4 5 1 2 6 7.
The only priority for this sorting algorithm is the least number of uses of the function insertBetween(element, after, before), since using it is extremely expensive. I wish to implement it in Python.
Don't focus on the elements you move. They can be moved anywhere and are not the problem. Instead, focus on the elements you don't move. Those need to be already sorted. So if you have N elements and a longest sorted subsequence with length L, you just need N-L moves to move the N-L elements not in that subsequence, and you can't do better. Finding the longest sorted subsequence is a standard problem, but look here if you don't know how to do it.

Categories

Resources