I have 3 files representing the same dataset split in 3 and I need to concatenate:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2')
df3 = pandas.read_csv('path3')
df = pandas.concat([df1,df2,df3])
But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?
I think you need numpy.concatenate with DataFrame constructor:
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
Another solution is replace columns names in df2 and df3:
df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)
Samples:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
A B F
0 8 8 3
1 7 7 0
df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
E R T
0 4 2 5
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
H T R
0 2 2 2
1 1 0 8
2 4 0 9
print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
[7 7 0]
[4 2 5]
[2 2 2]
[1 0 8]
[4 0 9]]
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
You have to use argument skip_rows of read_csv for second and third lines like here:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)
df = pandas.concat([df1,df2,df3])
Been working on this recently myself, here's the most compact/elegant thing I came up with:
import pandas as pd
frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)
Use:
df = pd.merge(df1, df2, how='outer')
Merge rows that appear in either or both df1 and df2 (union).
Related
I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]
How can I join values in columns with the same name in MultiIndex pandas DataFrame?
data = [['1','1','2','3','4'],['2','5','6','7','8']]
df = pd.DataFrame(data, columns=['id','A','B','A','B'])
df = df.set_index('id')
df.columns = pd.MultiIndex.from_tuples([('result','A'),('result','B'),('student','A'),('student','B')])
df
result student
A B A B
id
1 1 2 3 4
2 5 6 7 8
Desired results:
A B
id
1 "1 3" "2 4"
2 "5 7" "6 8"
I am not completely sure what you are asking. If you have two separate dataframes then you should be able to just use pd.concat.
pd.concat([df1, df2], axis=1)
If you have one dataframe then just drop the top level of the index.
df.columns = df.columns.droplevel(0)
New answer:
For join values by second level of MultiIndex in columns use groupby with agg:
#select columns define in list
df = df[['result','student']]
df1 = df.astype(str).groupby(level=1, axis=1).agg(' '.join)
print (df1)
A B
id
1 1 3 2 4
2 5 7 6 8
Old answer:
You can use sort_index for sorting columns and then droplevel for remove first level of MultiIndex.
But get duplicate columns names.
print (df)
result student col
A B A B A B
id
1 1 2 3 4 6 7
2 5 6 7 8 2 1
#select columns define in list
df = df[['result','student']]
print (df)
result student
A B A B
id
1 1 2 3 4
2 5 6 7 8
df = df.sort_index(axis=1, level=1)
df.columns = df.columns.droplevel(0)
print (df)
A A B B
id
1 1 3 2 4
2 5 7 6 8
So better, unique columns names can be created by map with join:
df = df.sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
print (df)
result_A student_A result_B student_B
id
1 1 3 2 4
2 5 7 6 8
df = pd.concat([df['result'],df['student']], axis=1).sort_index(axis=1)
print (df)
A A B B
id
1 1 3 2 4
2 5 7 6 8
I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]
I have write down a code to append several dummy DataFrame into one. After appending, the expected "DataFrame.shape" would be (9x3). But my code producing something unexpected output (6x3). How can i rectify the error of my code.
import pandas as pd
a = [[1,2,4],[1,3,4],[2,3,4]]
b = [[1,1,1],[1,6,4],[2,9,4]]
c = [[1,3,4],[1,1,4],[2,0,4]]
d = [[1,1,4],[1,3,4],[2,0,4]]
df1 = pd.DataFrame(a,columns=["a","b","c"])
df2 = pd.DataFrame(b,columns=["a","b","c"])
df3 = pd.DataFrame(c,columns=["a","b","c"])
for df in (df1, df2, df3):
df = df.append(df, ignore_index=True)
print df
I don't want use "pd.concat" because in this case i have to store all the data frame into memory and my real data set contains hundred of data frame with huge shape. I just want a code which can open one CSV file at once into loop update the final DF with the progress of loop
thanks
Firstly use concat to concatenate a bunch of dfs it's quicker:
In [308]:
df = pd.concat([df1,df2,df3], ignore_index=True)
df
Out[308]:
a b c
0 1 2 4
1 1 3 4
2 2 3 4
3 1 1 1
4 1 6 4
5 2 9 4
6 1 3 4
7 1 1 4
8 2 0 4
secondly you're reusing the iterable in your loop which is why it overwrites it, if you did this it would work:
In [307]:
a = [[1,2,4],[1,3,4],[2,3,4]]
b = [[1,1,1],[1,6,4],[2,9,4]]
c = [[1,3,4],[1,1,4],[2,0,4]]
d = [[1,1,4],[1,3,4],[2,0,4]]
df1 = pd.DataFrame(a,columns=["a","b","c"])
df2 = pd.DataFrame(b,columns=["a","b","c"])
df3 = pd.DataFrame(c,columns=["a","b","c"])
df = pd.DataFrame()
for d in (df1, df2, df3):
df = df.append(d, ignore_index=True)
df
Out[307]:
a b c
0 1 2 4
1 1 3 4
2 2 3 4
3 1 1 1
4 1 6 4
5 2 9 4
6 1 3 4
7 1 1 4
8 2 0 4
Here I changed the iterable to be d and declared an empty df outside the loop:
df = pd.DataFrame()
for d in (df1, df2, df3):
df = df.append(d, ignore_index=True)
I am using Python Pandas for the following. I have three dataframes, df1, df2 and df3. Each has the same dimensions, index and column labels. I would like to create a fourth dataframe that takes elements from df1 or df2 depending on the values in df3:
df1 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df1
Out[67]:
A B
0 1.335314 1.888983
1 1.000579 -0.300271
2 -0.280658 0.448829
3 0.977791 0.804459
df2 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df2
Out[68]:
A B
0 0.689721 0.871065
1 0.699274 -1.061822
2 0.634909 1.044284
3 0.166307 -0.699048
df3 = pd.DataFrame({'A': [1, 0, 0, 1], 'B': [1, 0, 1, 0]})
df3
Out[69]:
A B
0 1 1
1 0 0
2 0 1
3 1 0
The new dataframe, df4, has the same index and column labels and takes an element from df1 if the corresponding value in df3 is 1. It takes an element from df2 if the corresponding value in df3 is a 0.
I need a solution that uses generic references (e.g. ix or iloc) rather than actual column labels and index values because my dataset has fifty columns and four hundred rows.
As your DataFrames happen to be numeric, and the selector matrix happens to be of indicator variables, you can do the following:
>>> pd.DataFrame(
df1.as_matrix() * df3.as_matrix() + df1.as_matrix() * (1 - df3.as_matrix()),
index=df1.index,
columns=df1.columns)
I tried it by me and it works. Strangely enough, #Yakym Pirozhenko's answer - which I think is superior - doesn't work by me as well.
df4 = df1.where(df3.astype(bool), df2) should do it.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df2 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df3 = pd.DataFrame(np.random.randint(2, size = (4,2)))
df4 = df1.where(df3.astype(bool), df2)
print df1, '\n'
print df2, '\n'
print df3, '\n'
print df4, '\n'
Output:
0 1
0 0 3
1 8 8
2 7 4
3 1 2
0 1
0 7 9
1 4 4
2 0 5
3 7 2
0 1
0 0 0
1 1 0
2 1 1
3 1 0
0 1
0 7 9
1 8 4
2 7 4
3 1 2