I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame(np.random.choice([0,1], (6,3)), columns=list('XYZ'))
X Y Z
0 1 0 1
1 1 1 0
2 0 0 0
3 0 1 1
4 0 1 1
5 1 1 1
Let's say I take the transpose and wish to save it
df = df.T
0 1 2 3 4 5
X 1 1 0 0 0 1
Y 0 1 0 1 1 1
Z 1 0 0 1 1 1
So, there three rows. I would like to save it in this format:
X 110001
Y 010111
Z 100111
I tried
df.to_csv("filename.txt", header=None, index=None, sep='')
However, this outputs an error:
TypeError: "delimiter" must be an 1-character string
Is it possible to save the dataframe in this manner, or is there some what to combine all columns into one? What is the most "pandas" solution?
Leave original df alone. Don't transpose.
df = pd.DataFrame(np.random.choice([0,1], (6,3)), columns=list('XYZ'))
df.astype(str).apply(''.join)
X 101100
Y 101001
Z 111110
dtype: object
If you do want to transpose, then you can do something like this:
In [126]: df.T.apply(lambda row: ''.join(map(str, row)), axis=1)
Out[126]:
X 001111
Y 000000
Z 010010
dtype: object
Related
I have a large dataset with many columns of numeric data and want to be able to count all the zeros in each of the rows. The following will generate a small sample of the data.
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df
While I can create a column to sum all the values in the rows with the following code:
df2=df.sum(axis=1)
df2
And I can get a count of the zeros in a column:
df.loc[df.a==1].count()
I haven't been able to figure out how to get a count of the zeros across each of the rows. Any assistance would be greatly appreciated.
For count matched values is possible use sum of Trues of boolean mask.
If need new column:
df['sum of 1'] = df.eq(1).sum(axis=1)
#alternative
#df['sum of 1'] = (df == 1).sum(axis=1)
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df['sum of 1'] = df.eq(1).sum(axis=1)
print (df)
a b c sum of 1
0 0 0 2 0
1 1 0 1 2
2 0 0 0 0
3 2 1 2 1
4 2 2 1 1
5 0 0 0 0
6 0 2 0 0
7 1 1 1 3
If need new row:
df.loc['sum of 1'] = df.eq(1).sum()
#alternative
#df.loc['sum of 1'] = (df == 1).sum()
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df.loc['sum of 1'] = df.eq(1).sum()
print (df)
a b c
0 0 0 2
1 1 0 1
2 0 0 0
3 2 1 2
4 2 2 1
5 0 0 0
6 0 2 0
7 1 1 1
sum of 1 2 2 3
df:
0 1 2
0 0.0481948 0.1054251 0.1153076
1 0.0407258 0.0890868 0.0974378
2 0.0172071 0.0376403 0.0411687
etc.
I would like to remove all values in which the x and y titles/values of the dataframe are equal, therefore, my expected output would be something like:
0 1 2
0 NaN 0.1054251 0.1153076
1 0.0407258 NaN 0.0974378
2 0.0172071 0.0376403 NaN
etc.
As shown, the values of (0,0), (1,1), (2,2) and so on, have been removed/replaced.
I thought of looping through the index as followed:
for (idx, row) in df.iterrows():
if (row.index) == ???
But don't know where to carry on or whether it's even the right approach
You can set the diagonal:
In [11]: df.iloc[[np.arange(len(df))] * 2] = np.nan
In [12]: df
Out[12]:
0 1 2
0 NaN 0.105425 0.115308
1 0.040726 NaN 0.097438
2 0.017207 0.037640 NaN
#AndyHayden's answer is really cool and taught me something. However, it depends on iloc and that the array is square and that everything is in the same order.
I generalized the concept here
Consider the data frame df
df = pd.DataFrame(1, list('abcd'), list('xcya'))
df
x c y a
a 1 1 1 1
b 1 1 1 1
c 1 1 1 1
d 1 1 1 1
Then we use numpy broadcasting and np.where to perform the same fancy index assignment:
ij = np.where(df.index.values[:, None] == df.columns.values)
df.iloc[list(map(list, ij))] = 0
df
x c y a
a 1 1 1 0
b 1 1 1 1
c 1 0 1 1
d 1 1 1 1
n is number of rows/columns
df.values[[np.arange(n)]*2] = np.nan
or
np.fill_diagonal(df.values, np.nan)
see https://stackoverflow.com/a/24475214/
I have a simple question which relates to similar questions here, and here.
I am trying to drop all columns from a pandas dataframe, which have only zeroes (vertically, axis=1). Let me give you an example:
df = pd.DataFrame({'a':[0,0,0,0], 'b':[0,-1,0,1]})
a b
0 0 0
1 0 -1
2 0 0
3 0 1
I'd like to drop column asince it has only zeroes.
However, I'd like to do it in a nice and vectorized fashion if possible. My data set is huge - so I don't want to loop. Hence I tried
df = df.loc[(df).any(1), (df!=0).any(0)]
b
1 -1
3 1
Which allows me to drop both columns and rows. But if I just try to drop the columns, locseems to fail. Any ideas?
You are really close, use any - 0 are casted to Falses:
df = df.loc[:, df.any()]
print (df)
b
0 0
1 1
2 0
3 1
If it's a matter of 0s and not sum, use df.any:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1
In [73]: df.loc[:, df.ne(0).any()]
Out[73]:
b
0 0
1 1
2 0
3 1
or:
In [71]: df.loc[:, ~df.eq(0).all()]
Out[71]:
b
0 0
1 1
2 0
3 1
If we want to check those that do NOT sum up to 0:
In [78]: df.loc[:, df.sum().astype(bool)]
Out[78]:
b
0 0
1 1
2 0
3 1
I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.
In short ... I have a Python Pandas data frame that is read in from an Excel file using 'read_table'. I would like to keep a handful of the series from the data, and purge the rest. I know that I can just delete what I don't want one-by-one using 'del data['SeriesName']', but what I'd rather do is specify what to keep instead of specifying what to delete.
If the simplest answer is to copy the existing data frame into a new data frame that only contains the series I want, and then delete the existing frame in its entirety, I would satisfied with that solution ... but if that is indeed the best way, can someone walk me through it?
TIA ... I'm a newb to Pandas. :)
You can use the DataFrame drop function to remove columns. You have to pass the axis=1 option for it to work on columns and not rows. Note that it returns a copy so you have to assign the result to a new DataFrame:
In [1]: from pandas import *
In [2]: df = DataFrame(dict(x=[0,0,1,0,1], y=[1,0,1,1,0], z=[0,0,1,0,1]))
In [3]: df
Out[3]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [4]: df = df.drop(['x','y'], axis=1)
In [5]: df
Out[5]:
z
0 0
1 0
2 1
3 0
4 1
Basically the same as Zelazny7's answer -- just specifying what to keep:
In [68]: df
Out[68]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [70]: df = df[['x','z']]
In [71]: df
Out[71]:
x z
0 0 0
1 0 0
2 1 1
3 0 0
4 1 1
*Edit*
You can specify a large number of columns through indexing/slicing into the Dataframe.columns object.
This object of type(pandas.Index) can be viewed as a dict of column labels (with some extended functionality).
See this extension of above examples:
In [4]: df.columns
Out[4]: Index([x, y, z], dtype=object)
In [5]: df[df.columns[1:]]
Out[5]:
y z
0 1 0
1 0 0
2 1 1
3 1 0
4 0 1
In [7]: df.drop(df.columns[1:], axis=1)
Out[7]:
x
0 0
1 0
2 1
3 0
4 1
You can also specify a list of columns to keep with the usecols option in pandas.read_table. This speeds up the loading process as well.