Drop / move header of a dataframe into first row - python

I have a dataframe who looks like this:
A B 10
0 A B 20
1 C A 10
so the headers are not the real headers of the dataframe (I have to map them from another dataframe), how can I drop the headers in this case into the first row, that it looks like this:
0 1 2
0 A B 10
1 A B 20
2 C A 10
Note that pd.read_csv(..., header=None) leads to an error in this case, I don't know why, so I am searching for a solution to fix it after I load the file.

The best is avoid it by header=None parameter in read_csv:
df = pd.read_csv(file, header=None)
If not possible append columns names converted to one row DataFrame to original data and then set range to columns names:
df = df.columns.to_frame().T.append(df, ignore_index=True)
df.columns = range(len(df.columns))
print (df)
0 1 2
0 A B 10
1 A B 20
2 C A 10

Let us try reset_index for fixing
df = df.T.reset_index().T

Related

DataFrame insert row

I have some troubles with my Python work,
my steps are:
1)add the list to ordinary Dataframe
2)delete the columns which is min in the list
my list is called 'each_c' and my ordinary Dataframe is called 'df_col'
I want it to become like this:
hope someone can help me, thanks!
This is clearly described in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_col.drop(columns=[3])
Convert each_c to Series, append by DataFrame.append and then get indices by minimal value by Series.idxmin and pass to drop - it remove only first minimal column:
s = pd.Series(each_c)
df = df_col.append(s, ignore_index=True).drop(s.idxmin(), axis=1)
If need remove all columns if multiple minimals:
each_c = [-0.025,0.008,-0.308,-0.308]
s = pd.Series(each_c)
df_col = pd.DataFrame(np.random.random((10,4)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
0 1
0 0.602312 0.641220
1 0.586233 0.634599
2 0.294047 0.339367
3 0.246470 0.546825
4 0.093003 0.375238
5 0.765421 0.605539
6 0.962440 0.990816
7 0.810420 0.943681
8 0.307483 0.170656
9 0.851870 0.460508
10 -0.025000 0.008000
EDIT: If solution raise error:
IndexError: Boolean index has wrong length:
it means there is no default columns name by range - 0,1,2,3. Possible solution is set index values in Series by rename:
each_c = [-0.025,0.008,-0.308,-0.308]
df_col = pd.DataFrame(np.random.random((10,4)), columns=list('abcd'))
s = pd.Series(each_c).rename(dict(enumerate(df.columns)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
a b
0 0.321498 0.327755
1 0.514713 0.575802
2 0.866681 0.301447
3 0.068989 0.140084
4 0.069780 0.979451
5 0.629282 0.606209
6 0.032888 0.204491
7 0.248555 0.338516
8 0.270608 0.731319
9 0.732802 0.911920
10 -0.025000 0.008000

Data cleaning: Remove 0 value from my dataset having a header and index_col

I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like

How to delete all columns in DataFrame except certain ones?

Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)

how to write the pivot_table to txt file by python

I have get the pivot_table as follows:
there are spaces in the table,
what i want to write to txt is:
how to get it ?
chaoshidishi=pd.pivot_table(clsc,index="故障发生地市",values="工单号",aggfunc=len)
chaoshidishi=chaoshidishi.to_frame()
f=open('E:\gaotie\dishi.txt','w')
for row in chaoshidishi:
f.write(row[0]+row[1])
f.close()
Following up on #shanmuga's comment, you should be able to use to_csv() without first using to_frame().
First, here's some sample data that seems to reflect your setup:
import pandas as pd
group = ['a','a','b','c','c']
value = [1,2,3,4,5]
df = pd.DataFrame({'group':group,'value':value})
print(df)
group value
0 a 1
1 a 2
2 b 3
3 c 4
4 c 5
Now apply pivot_table():
df.pivot_table(columns='group', values='value', aggfunc=len)
group
a 2
b 1
c 2
Name: value, dtype: int64
You can save to file directly from this output. If you don't want to preserve index and column names, use header=None on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt'))
newdf = pd.read_csv('foo.txt', header=None)
print(newdf)
0 1
0 a 2
1 b 1
2 c 2
To preserve column and index names, use the header argument on save, and the index_col argument on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt', header='group'))
newdf = pd.read_csv('foo.txt', index_col='group')
print(newdf)
value
group
a 2
b 1
c 2

Saving and Loading of dataframe to csv results in Unnamed columns

prob in the title. exaple:
x=[('a','a','c') for i in range(5)]
df = DataFrame(x,columns=['col1','col2','col3'])
df.to_csv('test.csv')
df1 = read_csv('test.csv')
Unnamed: 0 col1 col2 col3
0 0 a a c
1 1 a a c
2 2 a a c
3 3 a a c
4 4 a a c
The reason seems to be that when saving a dataframe, the index column is written also, with no name in the header. then when you load the csv again, it is loaded with the index column as unnamed column. Is this a bug? How can I avoid writing a csv with the index, or dropping unnamed columns in reading?
You can remove row labels via the index and index_label parameters of to_csv.
These are not symmetric as there are ambiguities in the csv format because of the positioning. You need to specify an index_col on read-back
In [1]: x=[('a','a','c') for i in range(5)]
In [2]: df = DataFrame(x,columns=['col1','col2','col3'])
In [3]: df.to_csv('test.csv')
In [4]: !cat test.csv
,col1,col2,col3
0,a,a,c
1,a,a,c
2,a,a,c
3,a,a,c
4,a,a,c
In [5]: pd.read_csv('test.csv',index_col=0)
Out[5]:
col1 col2 col3
0 a a c
1 a a c
2 a a c
3 a a c
4 a a c
This looks very similar to the above, so is 'foo' a column or an index?
In [6]: df.index.name = 'foo'
In [7]: df.to_csv('test.csv')
In [8]: !cat test.csv
foo,col1,col2,col3
0,a,a,c
1,a,a,c
2,a,a,c
3,a,a,c
4,a,a,c
You can specify explicitly which columns you want to write using cols parameter.
That s how use index
df.to_csv('test.csv', index_label=False)
But for me, when I've tried submit to Kaggle it's return error "ERROR: Record 1 had 3 columns but expected 2", so I solved it use this code.

Categories

Resources