I have 2 tables, say:
table1 = 101 1 2 3
201 4 5 6
301 7 8 9
table2 = 10 11 101 12
13 14 201 15
16 17 301 18
It is clear that table1 column1 and table2 column 3 are the columns in common. I want to join these 2 tables using pd.join but the problem is that my tables do not have a header. So how can I do this using pandas?
EDIT
I am using pd.read_csv to read the tables. And my tables are text files.
outputtable = 101 1 2 3 10 11 12
201 4 5 6 13 14 15
301 7 8 9 16 17 18
and I would like to export the outputtable as a text file.
I'd set the index to the ordinal columns that you want to merge on, then merge, rename the index name as you need to reset the index afterwards:
In [121]:
import io
import pandas as pd
# read in data, you can ignore the io.StringIO bit and replace with your paths
t="""101 1 2 3
201 4 5 6
301 7 8 9"""
table1 = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
t1="""10 11 101 12
13 14 201 15
16 17 301 18"""
table2 = pd.read_csv(io.StringIO(t1), sep='\s+', header=None)
# merge the tables after setting index
merged = table1.set_index(0).merge(table2.set_index(2), left_index=True, right_index=True)
# rename the index name so it doesn't bork complaining about column 0 existing already
merged.index.name = 'index'
merged = merged.reset_index()
merged
Out[121]:
index 1_x 2 3_x 0 1_y 3_y
0 101 1 2 3 10 11 12
1 201 4 5 6 13 14 15
2 301 7 8 9 16 17 18
You can now export the df as desired and pass header=False:
In [124]:
merged.to_csv(header=False, index=False)
Out[124]:
'101,1,2,3,10,11,12\n201,4,5,6,13,14,15\n301,7,8,9,16,17,18\n'
What you can easily do as well (I assume df1 and df2 are your two tables):
l1 = [''.join(df1.applymap(str)[c].tolist()) for c in df1]
l2 = [''.join(df2.applymap(str)[c].tolist()) for c in df2]
indexes = [l1.index(i) for i in list(set(l1)-set(l2))]
In [194]: pd.concat([df2, df1.ix[:,indexes]], axis=1)
Out[194]:
0 1 2 3 1 2 3
0 10 11 101 12 1 2 3
1 13 14 201 15 4 5 6
2 16 17 301 18 7 8 9
Related
I have 2 dataframes let's say->
df1 =>
colA colB colC
0 1 2
3 4 5
6 7 8
df2 (same number of rows and columns) =>
colD colE colF
10 11 12
13 14 15
16 17 18
I want to compare columns from both dataframes , example ->
df1['colB'] < df2['colF']
Currently I am getting ->
ValueError: Can only compare identically-labeled Series objects
while comparing in ->
EDIT :
df1.loc[
df1['colB'] < df2['colF']
],'set_something' = 1;
Any help how I can implement it ? Thanks
You have the error because your series are not aligned (and might have duplicated indices)
If you just care about position, not indices use the underlying nummy array:
df1['colB'] < df2['colF'].to_numpy()
If you want to assign back in a column, make sure to transform the column full the other DataFrame to array.
df1['new'] = df1['colB'] < df2['colF'].to_numpy()
Or
df2['new'] = df1['colB'].to_numpy() < df2['colF']
This is a non-equi join; you should get more performance with some of binary search; conditional_join from pyjanitor does that under the hood:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('colB', 'colF', '<'))
colA colB colC colD colE colF
0 0 1 2 10 11 12
1 0 1 2 13 14 15
2 0 1 2 16 17 18
3 3 4 5 10 11 12
4 3 4 5 13 14 15
5 3 4 5 16 17 18
6 6 7 8 10 11 12
7 6 7 8 13 14 15
8 6 7 8 16 17 18
If it is based on an equality (df1.colB == df2.colF), then pd.merge should suffice and is efficient
I have a data frame with 790 rows. I want to create a new data frame that excludes rows from 300 to 400 and leave the rest.
I tried:
df.loc[[:300, 400:]]
df.iloc[[:300, 400:]]
df_new=df.drop(labels=range([300:400]),
axis=0)
This does not work. How can I achieve this goal?
Thanks in advance
Use range or numpy.r_ for join indices:
df_new=df.drop(range(300,400))
df_new=df.iloc[np.r_[0:300, 400:len(df)]]
Sample:
df = pd.DataFrame({'a':range(20)})
# print (df)
df1 = df.drop(labels=range(7,15))
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19
df1 = df.iloc[np.r_[0:7, 15:len(df)]]
print (df1)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
15 15
16 16
17 17
18 18
19 19
First select index you want to drop and then create a new df
i = df.iloc[299:400].index
new_df = df.drop(i)
Basic question - I am trying to concatenate two DataFrames, with the resulting DataFrame preserving the index in order of the original two. For example:
df = pd.DataFrame({'Houses':[10,20,30,40,50], 'Cities':[3,4,7,6,1]}, index = [1,2,4,6,8])
df2 = pd.DataFrame({'Houses':[15,25,35,45,55], 'Cities':[1,8,11,14,4]}, index = [0,3,5,7,9])
Using pd.concat([df, df2]) simply appends df2 to the end of df1. I am trying to instead concatenate them to produce correct index order (0 through 9).
Use concat with parameter sort for avoid warning and then DataFrame.sort_index:
df = pd.concat([df, df2], sort=False).sort_index()
print(df)
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55
Try using:
print(df.T.join(df2.T).T.sort_index())
Output:
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55
I have been on modifying an excel document with Pandas. I only need to work with small sections at a time, and breaking each into a separate DataFrame and then recombining back into the whole after modifying seems like the best solution. Is this feasible? I've tried a couple options with merge() and concat() but they don't seem to give me the results I am looking for.
As previously stated, I've tried using the merge() function to recombine them back together with the larger table I just got a memory error, and when I tested it with smaller dataframes, rows weren't maintained.
here's an small scale example of what I am looking to do:
import pandas as pd
df1 = pd.DataFrame({'A':[1,2,3,5,6],'B':[3,10,11,13,324],'C':[64,'','' ,'','' ],'D':[32,45,67,80,100]})#example df
print(df1)
df2= df1[['B','C']]#section taken
df2.at[2,'B'] = 1#modify area
print(df2)
df1 = df1.merge(df2)#merge dataframes
print(df1)
output:
A B C D
0 1 3 64 32
1 2 10 45
2 3 11 67
3 5 13 80
4 6 324 100
B C
0 3 64
1 10
2 1
3 13
4 324
A B C D
0 1 3 64 32
1 2 10 45
2 5 13 80
3 6 324 100
what I would like to see
A B C D
0 1 3 64 32
1 2 10 45
2 3 11 67
3 5 13 80
4 6 324 100
B C
0 3 64
1 10
2 1
3 13
4 324
A B C D
0 1 3 64 32
1 2 10 45
2 3 1 67
3 5 13 80
4 6 324 100
as I said before,in my actual code I just get a memoryerror if I try this due to the size of the dataframe
No need for merging here, you can just re-assign back the values from df2 into df1:
...
df1.loc[df2.index, df2.columns] = df2 #recover changes into original dataframe
print(df1)
giving as expected:
A B C D
0 1 3 64 32
1 2 10 45
2 3 1 67
3 5 13 80
4 6 324 100
df1.update(df2) gives same result (thanks to Quang Hoang for the precision)
so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)