Merge two data frames

Merge two data frames - python

I tried two merge two data frames by adding the first line of the second df to the first line of the first df. I also tried to concatenate them but eiter failed.
The format of the Data is
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--
2,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---
4,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---
6,0.000,,,,,,,
The expected format of the output should be
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--,0.000,,,,,,,
2,3,N0129,Position,62.2,0.376,62.238,0.136,***---**,76.1,-36.000,0.300,-36.057,,,,
3,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---,0.000,,,,,,,
I already splitted the dataframe from above into two frames. The first one contains only the odd indexes and the second one the even one's.
My problem is now, to merge/concatenate the two frames, by adding the first row of the second df to the first row of the first df. I already tried some methods of merging/concatenating but all of them failed. All the print functions are not neccessary, I only use them to have a quick overview in the console.
The code which I felt most comfortable with is:
os.chdir(output)
csv_files = os.listdir('.')
for csv_file in (csv_files):
if csv_file.endswith(".asc.csv"):
df = pd.read_csv(csv_file)
keep_col = ['Messpunkt', 'Zeichnungspunkt', 'Eigenschaft', 'Position', 'Sollmass','Toleranz','Abweichung','Lage']
new_df = df[keep_col]
new_df = new_df[~new_df['Messpunkt'].isin(['**Teil'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**KS-Oben'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**KS-Unten'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**N'])]
print(new_df)
new_df.to_csv(output+csv_file)
df1 = new_df[new_df.index % 2 ==1]
df2 = new_df[new_df.index % 2 ==0]
df1.reset_index()
df2.reset_index()
print (df1)
print (df2)
merge_df = pd.concat([df1,df2], axis=1)
print (merge_df)
merge_df.to_csv(output+csv_file)
I highly appreciate some help.
With this code, the output is:
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--,,,,,,,,
2,,,,,,,,,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---,,,,,,,,
4,,,,,,,,,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---,,,,,,,,
6,,,,,,,,,0.000,,,,,,,

I get expected result when I use reset_index() to have the same index in both DataFrames.
It may need also drop=True to skip index as new column
pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
Minimal working example.
I use io only to simulate file in memory.
text = '''1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--
2,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---
4,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---
6,0.000,,,,,,,'''
import pandas as pd
import io
pd.options.display.max_columns = 20 # to display all columns
df = pd.read_csv(io.StringIO(text), header=None, index_col=0)
#print(df)
df1 = df[df.index % 2 == 1] # .reset_index(drop=True)
df2 = df[df.index % 2 == 0] # .reset_index(drop=True)
#print(df1)
#print(df2)
merge_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
print(merge_df)
Result:
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
0 3.0 N0128 Durchm. 5.0 0.100 5.076 0.076 -----****-- 0.0 NaN NaN NaN NaN NaN NaN NaN
1 3.0 N0129 Position 62.2 0.376 62.238 0.136 ***--- 76.1 -36.000 0.300 -36.057 NaN NaN NaN NaN
2 2.0 N0130 Durchm. 5.0 0.100 5.067 0.067 -----***--- 0.0 NaN NaN NaN NaN NaN NaN NaN
EDIT:
It may need
merge_df.index = merge_df.index + 1
to correct index.

Related

Pivot table reindexing in pandas

Having a dataframe as below:
df1 = pd.DataFrame({'Name1':['A','Q','A','B','B','C','C','C','E','E','E'],
'Name2':['B','C','D','C','D','D','A','B','A','B','C'],'Marks2':[10,20,6,50, 88,23,140,9,60,65,70]})
df1
#created a new frame
new=df1.loc[(df1['Marks2'] <= 50)]
new
#created a pivot table
temp=new.pivot_table(index="Name1", columns="Name2", values="Marks2")
temp
I tried to re-index the pivot table.
new_value=['E']
order = new_value+list(temp.index.difference(new_value))
matrix=temp.reindex(index=order, columns=order)
matrix
But the values related to 'E' is not present in pivot table. dataframe df1 contains values related with E. I need to add the value related to E in the pivot_table
Expected output:

Based on the comments my understanding of the intended result:
E A B C D
E NaN 60.0 65.0 70.0 NaN
A NaN NaN 10.0 NaN 6.0
C NaN NaN 9.0 NaN 23.0
Q NaN NaN NaN 20.0 NaN
Code:
Activate the inlcuded #print() statements to see what the steps do.
Especially at the header 'formatting' in the end you may adapt acc. your needs.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name1':['A','Q','A','B','B','C','C','C','E','E','E'],
'Name2':['B','C','D','C','D','D','A','B','A','B','C'],
'Marks2':[10,20,6,50, 88,23,140,9,60,65,70]})
df1['Marks2'] = np.where( (df1['Marks2'] >= 50) & (df1['Name1'] != 'E'),
np.nan, df1['Marks2'])
#print(df1)
temp=df1.pivot_table(index="Name1", columns="Name2", values="Marks2")
#print(temp)
name1_to_move = 'E'
# build new index with name1_to_move at the start (top in df idx)
idx=temp.index.tolist()
idx.pop(idx.index(name1_to_move))
idx.insert(0, name1_to_move)
# moving the row to top by reindex
temp=temp.reindex(idx)
#print(temp)
temp.insert(loc=0, column=name1_to_move, value=np.nan)
#print(temp)
temp.index.name = None
#print(temp)
temp = temp.rename_axis(None, axis=1)
print(temp)

insert missing rows in df with dictionary values

Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)

Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2

If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN

Concat two Pandas DataFrame column with different length of index

How do I add a merge columns of Pandas dataframe to another dataframe while the new columns of data has less rows? Specifically I need to new column of data to be filled with NaN at the first few rows in the merged DataFrame instead of the last few rows. Please refer to the picture. Thanks.

Use:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
df2 = pd.DataFrame({
'SMA':list('rty')
})
df3 = df1.join(df2.set_index(df1.index[-len(df2):]))
Or:
df3 = pd.concat([df1, df2.set_index(df1.index[-len(df2):])], axis=1)
print (df3)
A B SMA
0 a 4 NaN
1 b 5 NaN
2 c 4 NaN
3 d 5 r
4 e 5 t
5 f 4 y
How it working:
First is selected index in df1 by length of df2 from back:
print (df1.index[-len(df2):])
RangeIndex(start=3, stop=6, step=1)
And then is overwrite existing values by DataFrame.set_index:
print (df2.set_index(df1.index[-len(df2):]))
SMA
3 r
4 t
5 y

Merge 3 or more dataframes

I'am trying to merge 3 dataframes by index however so far unsuccessfully.
Here is the code:
import pandas as pd
from functools import reduce
#identifying csvs
x='/home/'
csvpaths = ("Data1.csv", "Data2.csv", "Data3.csv")
dfs = list() # an empty list
#creating dataframes based on number of csvs
for i in range (len(csvpaths)):
dfs.append(pd.read_csv(str(x)+ csvpaths[i],index_col=0))
print(dfs[1])
#creating suffix for each dataframe's columns
S=[]
for y in csvpaths:
s=str(y).split('.csv')[0]
S.append(s)
print(S)
#merging attempt
dfx = lambda a,b: pd.merge(a,b,on='SHIP_ID',suffixes=(S)), dfs
print(dfx)
print(dfx.columns)
if i try to export it as csv i get an error as follows(similar error when i try to print dfx.columns):
'tuple' object has no attribute 'to_csv'
the output i want is merger of the 3 dataframes as follows(with respective suffixes), please help.
[Note:table below is very simplified,original table consists of dozens of columns and thousands of rows, hence require practical merging method]

Try:
for s,el in zip(suffixes, dfs):
el.columns=[str(col)+s for col in el.columns]
dfx=pd.concat(dfs, ignore_index=True, sort=False, axis=1)
For the test case I used:
import pandas as pd
dfs=[pd.DataFrame({"x": [1,2,7], "y": list("ghi")}), pd.DataFrame({"x": [5,6], "z": [4,4]}), pd.DataFrame({"x": list("acgjksd")})]
suffixes=["_1", "_2", "_3"]
for s,el in zip(suffixes, dfs):
el.columns=[str(col)+s for col in el.columns]
>>> pd.concat(dfs, ignore_index=True, sort=False, axis=1)
x_1 y_1 x_2 z_2 x_3
0 1.0 g 5.0 4.0 a
1 2.0 h 6.0 4.0 c
2 7.0 i NaN NaN g
3 NaN NaN NaN NaN j
4 NaN NaN NaN NaN k
5 NaN NaN NaN NaN s
6 NaN NaN NaN NaN d
Edit:
for s,el in zip(suffixes, dfs):
el.columns=[str(col)+s for col in el.columns]
el.set_index('ID', inplace=True)
dfx=pd.concat(dfs, ignore_index=False, sort=False, axis=1).reset_index()

Populating pandas dataframe efficiently using a 2-D numpy array

I have a 2-D numpy array each row of which consists of three elements - ['dataframe_column_name', 'dataframe_index', 'value'].
Now, I tried populating the pandas dataframe using iloc double for loop but it is quite slow. Is there any faster way of doing this. I am a bit new to pandas, so apologies in case this is something very basic.
Here is the code snippet :
my_nparray = [['a', 1, 123], ['b', 1, 230], ['a', 2, 321]]
for r in range(my_nparray.shape[0]):
[col, ind, value] = my_nparray[r]
df.iloc[col][ind] = value
This takes a lot of time when my_nparray is large, is there any other way of doing this?
Initially assume that I can create this data frame :
'a' 'b'
1 NaN NaN
2 NaN NaN
I want the output as :
'a' 'b'
1 123 230
2 321 NaN

You can use from_records and then pivot:
df = pd.DataFrame.from_records(my_nparray, index=1).pivot(columns=0)
2
0 a b
1
1 123.0 230.0
2 321.0 NaN
This specifies that the index uses field 1 from your array and pivot uses Series 0 for the columns.
Then we can reset the MultiIndex on the columns and the index:
df.columns = df.columns.droplevel(None)
df.columns.name = None
df.index.name = None
a b
1 123.0 230.0
2 321.0 NaN

Use DataFrame constructor with DataFrame.pivot and DataFrame.rename_axis:
df = pd.DataFrame(my_nparray).pivot(1,0,2).rename_axis(index=None, columns=None)
print (df)
a b
1 123.0 230.0
2 321.0 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two data frames - python

Related

Pivot table reindexing in pandas

insert missing rows in df with dictionary values

Concat two Pandas DataFrame column with different length of index

Merge 3 or more dataframes

Populating pandas dataframe efficiently using a 2-D numpy array

Categories

Resources