Transforming Pandas DataFrame into List of DataFrames - python

I have data that looks like this:
1.00 1.00 1.00
3.23 4.23 0.33
1.23 0.13 3.44
4.55 12.3 14.1
2.00 2.00 2.00
1.21 1.11 1.11
3.55 5.44 5.22
4.11 1.00 4.00
It comes in chunk of 4. The first line of the chunk is index and the rest are the values.
The chunk always comes in 4 lines, but number of columns can be more than 3.
For example:
1.00 1.00 1.00 <- 1st chunk, the index = 1
3.23 4.23 0.33 <- values
1.23 0.13 3.44 <- values
4.55 12.3 14.1 <- values
My example above only contains 2 chunks, but actually it can contain more than that.
What I want to do is to create a dictionary of data frames so I can process them
chunk by chunk. Namely from this:
In [1]: import pandas as pd
In [2]: df = pd.read_table("http://dpaste.com/29R0BSS.txt",header=None, sep = " ")
In [3]: df
Out[3]:
0 1 2
0 1.00 1.00 1.00
1 3.23 4.23 0.33
2 1.23 0.13 3.44
3 4.55 12.30 14.10
4 2.00 2.00 2.00
5 1.21 1.11 1.11
6 3.55 5.44 5.22
7 4.11 1.00 4.00
Into list of data frame, such that I can do something like this (I do this by hand):
>> # Let's call new data frame `nd`.
>> nd[1]
>> 0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10

There are lots of ways to do this; I tend to use groupby, e.g. something like
>>> grouped = df.groupby(np.arange(len(df)) // 4)
>>> d = {v.iloc[0][0]: v.iloc[1:].reset_index(drop=True) for k,v in grouped}
>>> for k,v in d.items():
... print(k)
... print(v)
...
1.0
0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10
2.0
0 1 2
0 1.21 1.11 1.11
1 3.55 5.44 5.22
2 4.11 1.00 4.00

Related

Pandas: Conditionally dropping columns based on same values throughout the column in MultiIndex dataframe

I have a dataframe as below:
data = {('5105', 'Open'): [1.99,1.98,1.99,2.05,2.15],
('5105', 'Adj Close'): [1.92,1.92,1.96,2.07,2.08],
('5229', 'Open'): [0.01]*5,
('5229', 'Adj Close'): [0.02]*5,
('7076', 'Open'): [1.02,1.01,1.01,1.06,1.06],
('7076', 'Adj Close'): [0.90,0.92,0.94,0.94,0.95]}
df = pd.DataFrame(data)
5105 5229 7076
Open Adj Close Open Adj Close Open Adj Close
0 1.99 1.92 0.01 0.02 1.02 0.90
1 1.98 1.92 0.01 0.02 1.01 0.92
2 1.99 1.96 0.01 0.02 1.01 0.94
3 2.05 2.07 0.01 0.02 1.06 0.94
4 2.15 2.08 0.01 0.02 1.06 0.95
As the dataframe above, we can see that df['5229'] has both columns Open and Adj Close having the same values respectively throughout the column. So, I intend to drop it since it will not be useful in my analysis.
I have two queries:
How do I drop the column on level 0 (that is the 1st column) if its subcolumns have the same values respectively throughout the column?
On the other hand, if there's just one subcolumn that has the same values throughout the column, how can I drop it?
As this is a conditional-based dropping, I was wondering if df.drop still works in this case?
Based on my 1st and 2nd query, in my case above, since the Open and Adj Close are having same values throughout the column, I would like to drop it entirely.
The expected output is:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
Edit
Really thank you for those answering the question. Just to be more concise, I was trying to drop the columns from the dataframe consisting of more than 200 columns given the condition if all the values in that particular column are the same.
Try with nunique
df = df.loc[:,~(df.nunique()==1).values]
Out[125]:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
you can try this:
for a, b in df.columns:
if df[a][b].duplicated(keep=False).sum() == df[a][b].size:
df.drop((a, b), axis=1, inplace=True)
Result:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
Try this:
df.drop('5229',level=0,axis=1)
Output:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
We could use unstack + groupby + nunique to get the number of unique values in each column. Then select only the columns with more than 1 value by the loc:
out = df[df.unstack().groupby(level=[0,1]).nunique().loc[lambda x: x!=1].index]
Output:
5105 7076
Adj Close Open Adj Close Open
0 1.92 1.99 0.90 1.02
1 1.92 1.98 0.92 1.01
2 1.96 1.99 0.94 1.01
3 2.07 2.05 0.94 1.06
4 2.08 2.15 0.95 1.06

How to fill missing timestamps in pandas

I have a CSV file as below:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
Now I have to fill in the missing minutes by last observation carried forward. Expected output:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170001 17 0 0.40 0.41 1.33 1.55 2.45
2 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170003 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170004 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170005 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170006 17 0 0.40 0.41 1.34 1.51 2.46
3 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
I tried following this link but unable to achieve the expected output. Sorry I'm new to coding.
First create DatetimeIndex by to_datetime and DataFrame.set_index and then change frequency by DataFrame.asfreq:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().asfreq('Min', method='ffill')
print (df)
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-17 00:00:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:01:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:02:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:03:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:04:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:05:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:06:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:07:00 17 0 0.4 0.37 1.35 1.45 2.40
Or use DataFrame.resample with Resampler.ffill:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().resample('Min').ffill()

Re-shaping Dataframe so that Column Headers are made into Rows

I am trying to reshape the dataframe below.
Tenor 2013M06D12 2013M06D13 2013M06D14 \
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
So, that it looks as follows. I was looking at using pivot_table, but this is sort of the opposite of what that would do as I need to convert Column Headers to rows and not the other way around. Hence, I am not sure how to proceed in order to obtain this dataframe.
Date Tenor Rate
1 2013-06-12 1 1.24
2 2013-06-13 1 1.26
4 2013-06-14 1 1.23
The code just involves reading from a CSV:
result = pd.DataFrame.read_csv("BankofEngland.csv")
I think you can do with with a melt, a sort, a date parse, and some column shuffling:
dfm = pd.melt(df, id_vars="Tenor", var_name="Date", value_name="Rate")
dfm = dfm.sort("Tenor").reset_index(drop=True)
dfm["Date"] = pd.to_datetime(dfm["Date"], format="%YM%mD%d")
dfm = dfm[["Date", "Tenor", "Rate"]]
produces
In [104]: dfm
Out[104]:
Date Tenor Rate
0 2013-06-12 1 1.24
1 2013-06-13 1 1.26
2 2013-06-14 1 1.23
3 2013-06-12 2 2.01
4 2013-06-13 2 0.43
5 2013-06-14 2 0.45
6 2013-06-12 3 1.21
7 2013-06-13 3 2.24
8 2013-06-14 3 1.03
9 2013-06-12 4 0.39
10 2013-06-13 4 2.32
11 2013-06-14 4 1.23
import pandas as pd
import numpy as np
# try to read your sample data, replace with your read_csv func
df = pd.read_clipboard()
Out[139]:
Tenor 2013M06D12 2013M06D13 2013M06D14
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
# reshaping
df.set_index('Tenor', inplace=True)
df = df.stack().reset_index()
df.columns=['Tenor', 'Date', 'Rate']
# suggested by DSM, use the date parser
df.Date = pd.to_datetime(df.Date, format='%YM%mD%d')
Out[147]:
Tenor Date Rate
0 1 2013-06-12 1.24
1 1 2013-06-13 1.26
2 1 2013-06-14 1.23
3 2 2013-06-12 2.01
4 2 2013-06-13 0.43
.. ... ... ...
7 3 2013-06-13 2.24
8 3 2013-06-14 1.03
9 4 2013-06-12 0.39
10 4 2013-06-13 2.32
11 4 2013-06-14 1.23
[12 rows x 3 columns]

transpose multiple columns Pandas dataframe

I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])

Calculate Eucleudian distance between 20pairs of Amio acids using thieir Physicochemical properties scores in Python and the output is a Matrix

Can someone help me?
I have 20 amino-acids (AAs) and 7 physochemichal properties (RADA880102; FAUJ880103; ZIMJ680104; GRAR740102;CRAJ730103; BURA740101; CHAM820102)
The input is a tab delimited text file and it looks like this:
Amino-acids A R N D C Q E G H I L K M F P S T W Y V
RADA880102 0.52 -1.32 -0.01 0 0 -0.07 -0.79 0 0.95 2.04 1.76 0.08 1.32 2.09 0 0.04 0.27 2.51 1.63 1.18
FAUJ880103 1 6.13 2.95 2.78 2.43 3.95 3.78 0 4.66 4 4 4.77 4.43 5.89 2.72 1.6 2.6 8.08 6.47 3
ZIMJ680104 6 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02 5.98 9.74 5.74 5.48 6.3 5.68 5.66 5.89 5.66 5.96
GRAR740102 8.1 10.5 11.6 13 5.5 10.5 12.3 9 10.4 5.2 4.9 11.3 5.7 5.2 8 9.2 8.6 5.4 6.2 5.9
CRAJ730103 0.6 0.79 1.42 1.24 1.29 0.92 0.64 1.38 0.95 0.67 0.7 1.1 0.67 1.05 1.47 1.26 1.05 1.23 1.35 0.48
BURA740101 0.486 0.262 0.193 0.288 0.2 0.418 0.538 0.12 0.4 0.37 0.42 0.402 0.417 0.318 0.208 0.2 0.272 0.462 0.161 0.379
CHAM820102 -0.368 -1.03 0 2.06 4.53 0.731 1.77 -0.525 0 0.791 1.07 0 0.656 1.06 -2.24 -0.524 0 1.6 4.91 0.401
I am trying to write a script in Python to compute the Euclidean distance for each pair of AAs using the following formula
dist = sqrt[Σ[(xa-xb)^2 + (ya-yb)^2 + (za-zb)^2 + (ma-mb)^2 + (na-nb)^2 + (pa-pb)^2 + (ra-rb)^2]]
Where (xa, ya, za, ma, na, pa and ra) indicate one of the seven physicochemical properties of original AA and (xb, yb, zb, mb, nb, pb and rb) indicate the other one of the seven physicochemical properties of the substituting AA respectively.
For instance the Euclidian distance between two AAs A and R will looks like this
dist = sqrt[Σ[(0.52-(-1.32))^2 + (1-6.13)^2 + (6-10.76)^2 + (8.1-10.5)^2 + (0.6-0.79)^2 + (0.486-0.262)^2 + (-0.368-(-1.03))^2]]
The original formula can be found at this link on page 2
"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3589708/pdf/fgene-04-00021.pdf"
I would like my script to return a Matrix of Euclidian distance as the output with 380 values
A R N D C Q E G H I L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Thank you for any help
based on your comment the part you are stuck on is creating a matrix to return from a function
def zeros_matrix(width,height):
return [[0 for _ in range(width)] for _ in range(height)]
will return a matrix of all zeros further since I doubt that is the only place you are stuck
to open a file and read it into a matrix
with open("some_file") as f:
matrix = map(str.split,f)
print matrix
to calculate the dist between 2 rows
import math
def row_dist(row1,row2):
dist = #some calculation I dont really understand
return dist
print sorted(matrix,key=row_dist)

Categories

Resources