Merge two data frames - python
I tried two merge two data frames by adding the first line of the second df to the first line of the first df. I also tried to concatenate them but eiter failed.
The format of the Data is
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--
2,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---
4,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---
6,0.000,,,,,,,
The expected format of the output should be
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--,0.000,,,,,,,
2,3,N0129,Position,62.2,0.376,62.238,0.136,***---**,76.1,-36.000,0.300,-36.057,,,,
3,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---,0.000,,,,,,,
I already splitted the dataframe from above into two frames. The first one contains only the odd indexes and the second one the even one's.
My problem is now, to merge/concatenate the two frames, by adding the first row of the second df to the first row of the first df. I already tried some methods of merging/concatenating but all of them failed. All the print functions are not neccessary, I only use them to have a quick overview in the console.
The code which I felt most comfortable with is:
os.chdir(output)
csv_files = os.listdir('.')
for csv_file in (csv_files):
if csv_file.endswith(".asc.csv"):
df = pd.read_csv(csv_file)
keep_col = ['Messpunkt', 'Zeichnungspunkt', 'Eigenschaft', 'Position', 'Sollmass','Toleranz','Abweichung','Lage']
new_df = df[keep_col]
new_df = new_df[~new_df['Messpunkt'].isin(['**Teil'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**KS-Oben'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**KS-Unten'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**N'])]
print(new_df)
new_df.to_csv(output+csv_file)
df1 = new_df[new_df.index % 2 ==1]
df2 = new_df[new_df.index % 2 ==0]
df1.reset_index()
df2.reset_index()
print (df1)
print (df2)
merge_df = pd.concat([df1,df2], axis=1)
print (merge_df)
merge_df.to_csv(output+csv_file)
I highly appreciate some help.
With this code, the output is:
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--,,,,,,,,
2,,,,,,,,,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---,,,,,,,,
4,,,,,,,,,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---,,,,,,,,
6,,,,,,,,,0.000,,,,,,,
I get expected result when I use reset_index() to have the same index in both DataFrames.
It may need also drop=True to skip index as new column
pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
Minimal working example.
I use io only to simulate file in memory.
text = '''1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--
2,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---
4,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---
6,0.000,,,,,,,'''
import pandas as pd
import io
pd.options.display.max_columns = 20 # to display all columns
df = pd.read_csv(io.StringIO(text), header=None, index_col=0)
#print(df)
df1 = df[df.index % 2 == 1] # .reset_index(drop=True)
df2 = df[df.index % 2 == 0] # .reset_index(drop=True)
#print(df1)
#print(df2)
merge_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
print(merge_df)
Result:
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
0 3.0 N0128 Durchm. 5.0 0.100 5.076 0.076 -----****-- 0.0 NaN NaN NaN NaN NaN NaN NaN
1 3.0 N0129 Position 62.2 0.376 62.238 0.136 ***--- 76.1 -36.000 0.300 -36.057 NaN NaN NaN NaN
2 2.0 N0130 Durchm. 5.0 0.100 5.067 0.067 -----***--- 0.0 NaN NaN NaN NaN NaN NaN NaN
EDIT:
It may need
merge_df.index = merge_df.index + 1
to correct index.
Related
Pivot table reindexing in pandas
Having a dataframe as below: df1 = pd.DataFrame({'Name1':['A','Q','A','B','B','C','C','C','E','E','E'], 'Name2':['B','C','D','C','D','D','A','B','A','B','C'],'Marks2':[10,20,6,50, 88,23,140,9,60,65,70]}) df1 #created a new frame new=df1.loc[(df1['Marks2'] <= 50)] new #created a pivot table temp=new.pivot_table(index="Name1", columns="Name2", values="Marks2") temp I tried to re-index the pivot table. new_value=['E'] order = new_value+list(temp.index.difference(new_value)) matrix=temp.reindex(index=order, columns=order) matrix But the values related to 'E' is not present in pivot table. dataframe df1 contains values related with E. I need to add the value related to E in the pivot_table Expected output:
Based on the comments my understanding of the intended result: E A B C D E NaN 60.0 65.0 70.0 NaN A NaN NaN 10.0 NaN 6.0 C NaN NaN 9.0 NaN 23.0 Q NaN NaN NaN 20.0 NaN Code: Activate the inlcuded #print() statements to see what the steps do. Especially at the header 'formatting' in the end you may adapt acc. your needs. import pandas as pd import numpy as np df1 = pd.DataFrame({'Name1':['A','Q','A','B','B','C','C','C','E','E','E'], 'Name2':['B','C','D','C','D','D','A','B','A','B','C'], 'Marks2':[10,20,6,50, 88,23,140,9,60,65,70]}) df1['Marks2'] = np.where( (df1['Marks2'] >= 50) & (df1['Name1'] != 'E'), np.nan, df1['Marks2']) #print(df1) temp=df1.pivot_table(index="Name1", columns="Name2", values="Marks2") #print(temp) name1_to_move = 'E' # build new index with name1_to_move at the start (top in df idx) idx=temp.index.tolist() idx.pop(idx.index(name1_to_move)) idx.insert(0, name1_to_move) # moving the row to top by reindex temp=temp.reindex(idx) #print(temp) temp.insert(loc=0, column=name1_to_move, value=np.nan) #print(temp) temp.index.name = None #print(temp) temp = temp.rename_axis(None, axis=1) print(temp)
insert missing rows in df with dictionary values
Hello I have the following dataframe df = pd.DataFrame(data={'grade_1':['A','B','C'], 'grade_1_count': [19,28,32], 'grade_2': ['pass','fail',np.nan], 'grade_2_count': [39,18, np.nan]}) whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary grade_dict = {'grade_1':['A','B','C','D','E','F'], 'grade_2' : ['pass','fail','not present', 'borderline']} and the corresponding row value in the _count column should be filled with np.nan so the expected output is like this expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'], 'grade_1_count': [19,28,32,0,0,0], 'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan], 'grade_2_count': [39,18,0,0,np.nan,np.nan]}) so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks x=[] for k, v in grade_dict.items(): out = df[k].reindex(grade_dict[k], axis=0, fill_value=0) x = pd.concat([out], axis=1) x[k] = x.index x = x.reset_index(drop=True) df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges: # set up combinations from itertools import zip_longest df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict) # merge (df2.merge(df.filter(like='grade_1'), on='grade_1', how='left') .merge(df.filter(like='grade_2'), on='grade_2', how='left') .sort_index(axis=1) ) output: grade_1 grade_1_count grade_2 grade_2_count 0 A 19.0 pass 39.0 1 B 28.0 fail 18.0 2 C 32.0 not present NaN 3 D NaN borderline NaN 4 E NaN None NaN 5 F NaN None NaN multiple merges: df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict) for col in grade_dict: df2 = df2.merge(df.filter(like=col), on=col, how='left') df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first: print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(), index=grade_dict.keys()).T.set_index("grade_1")) .fillna({"grade_1_count": 0}).reset_index()) grade_1 grade_1_count grade_2 grade_2_count 0 A 19.0 pass 39.0 1 B 28.0 fail 18.0 2 C 32.0 not present NaN 3 D 0.0 borderline NaN 4 E 0.0 None NaN 5 F 0.0 None NaN
Concat two Pandas DataFrame column with different length of index
How do I add a merge columns of Pandas dataframe to another dataframe while the new columns of data has less rows? Specifically I need to new column of data to be filled with NaN at the first few rows in the merged DataFrame instead of the last few rows. Please refer to the picture. Thanks.
Use: df1 = pd.DataFrame({ 'A':list('abcdef'), 'B':[4,5,4,5,5,4], }) df2 = pd.DataFrame({ 'SMA':list('rty') }) df3 = df1.join(df2.set_index(df1.index[-len(df2):])) Or: df3 = pd.concat([df1, df2.set_index(df1.index[-len(df2):])], axis=1) print (df3) A B SMA 0 a 4 NaN 1 b 5 NaN 2 c 4 NaN 3 d 5 r 4 e 5 t 5 f 4 y How it working: First is selected index in df1 by length of df2 from back: print (df1.index[-len(df2):]) RangeIndex(start=3, stop=6, step=1) And then is overwrite existing values by DataFrame.set_index: print (df2.set_index(df1.index[-len(df2):])) SMA 3 r 4 t 5 y
Merge 3 or more dataframes
I'am trying to merge 3 dataframes by index however so far unsuccessfully. Here is the code: import pandas as pd from functools import reduce #identifying csvs x='/home/' csvpaths = ("Data1.csv", "Data2.csv", "Data3.csv") dfs = list() # an empty list #creating dataframes based on number of csvs for i in range (len(csvpaths)): dfs.append(pd.read_csv(str(x)+ csvpaths[i],index_col=0)) print(dfs[1]) #creating suffix for each dataframe's columns S=[] for y in csvpaths: s=str(y).split('.csv')[0] S.append(s) print(S) #merging attempt dfx = lambda a,b: pd.merge(a,b,on='SHIP_ID',suffixes=(S)), dfs print(dfx) print(dfx.columns) if i try to export it as csv i get an error as follows(similar error when i try to print dfx.columns): 'tuple' object has no attribute 'to_csv' the output i want is merger of the 3 dataframes as follows(with respective suffixes), please help. [Note:table below is very simplified,original table consists of dozens of columns and thousands of rows, hence require practical merging method]
Try: for s,el in zip(suffixes, dfs): el.columns=[str(col)+s for col in el.columns] dfx=pd.concat(dfs, ignore_index=True, sort=False, axis=1) For the test case I used: import pandas as pd dfs=[pd.DataFrame({"x": [1,2,7], "y": list("ghi")}), pd.DataFrame({"x": [5,6], "z": [4,4]}), pd.DataFrame({"x": list("acgjksd")})] suffixes=["_1", "_2", "_3"] for s,el in zip(suffixes, dfs): el.columns=[str(col)+s for col in el.columns] >>> pd.concat(dfs, ignore_index=True, sort=False, axis=1) x_1 y_1 x_2 z_2 x_3 0 1.0 g 5.0 4.0 a 1 2.0 h 6.0 4.0 c 2 7.0 i NaN NaN g 3 NaN NaN NaN NaN j 4 NaN NaN NaN NaN k 5 NaN NaN NaN NaN s 6 NaN NaN NaN NaN d Edit: for s,el in zip(suffixes, dfs): el.columns=[str(col)+s for col in el.columns] el.set_index('ID', inplace=True) dfx=pd.concat(dfs, ignore_index=False, sort=False, axis=1).reset_index()
Populating pandas dataframe efficiently using a 2-D numpy array
I have a 2-D numpy array each row of which consists of three elements - ['dataframe_column_name', 'dataframe_index', 'value']. Now, I tried populating the pandas dataframe using iloc double for loop but it is quite slow. Is there any faster way of doing this. I am a bit new to pandas, so apologies in case this is something very basic. Here is the code snippet : my_nparray = [['a', 1, 123], ['b', 1, 230], ['a', 2, 321]] for r in range(my_nparray.shape[0]): [col, ind, value] = my_nparray[r] df.iloc[col][ind] = value This takes a lot of time when my_nparray is large, is there any other way of doing this? Initially assume that I can create this data frame : 'a' 'b' 1 NaN NaN 2 NaN NaN I want the output as : 'a' 'b' 1 123 230 2 321 NaN
You can use from_records and then pivot: df = pd.DataFrame.from_records(my_nparray, index=1).pivot(columns=0) 2 0 a b 1 1 123.0 230.0 2 321.0 NaN This specifies that the index uses field 1 from your array and pivot uses Series 0 for the columns. Then we can reset the MultiIndex on the columns and the index: df.columns = df.columns.droplevel(None) df.columns.name = None df.index.name = None a b 1 123.0 230.0 2 321.0 NaN
Use DataFrame constructor with DataFrame.pivot and DataFrame.rename_axis: df = pd.DataFrame(my_nparray).pivot(1,0,2).rename_axis(index=None, columns=None) print (df) a b 1 123.0 230.0 2 321.0 NaN