I have a issue writing to excel after merging.
Outfile1 = r’k:\dir1\outfile1.xlsx’
DF0 = [‘A’, ‘B’, ‘C’]
DF1 = [‘A’, ‘B’, ‘D’]
DF2 = DF0.merge(DF1, on = [‘A’, ‘B’])
DF2.to_excel(Outfile1, engine = ‘xlsxwriter’)
Excel file has the following columns:
‘A’ ‘B’ ‘A’ ‘B’ ‘C’ ‘D’ the second ‘A’ & ‘B’ are blank.
What am I doing wrong? I only want ‘A’ ‘B’ ‘C’ ‘D’ in spreadsheet.
this should do it.
import pandas as pd
// data
data = {'letters':['a', 'e', 'c', 'g', 'h', 'b']}
data1 = {'letters':['a', 'd', 'b', 'e', 'f',]}
// data to df
df0 = pd.DataFrame(data)
df1 = pd.DataFrame(data1)
// merge
df2 = df0.merge(df1, how='outer')
now they are merged without duplicates, but out of order. Use sort_values to correct this
df2 = df2.sort_values(['letters'])
print(df2)
Conclusion: after merge , its recommended that reset the index otherwise sometime
reindex creates extra columns in output.
Step 1: dataset
df1=pd.DataFrame([[1,2,3],[4,5,6]],columns=['A','B','C'])
df2=pd.DataFrame([[7,7,7],[1,2,8]],columns=['A','B','D'])
df_merge=df1.merge(df2,on=['A','B'])
Step 2 , reset_index before reorder the dataframe through reindex.
my_col_order=['D','B','C']
df_merge.reset_index(inplace=True) # after merge , its recommended that reset the index otherwise sometime
#reindex creates extra columns in output.
df_5=df_merge.reindex(my_col_order,axis='columns')
Related
My code is :
dfs = [df1,df2,df3]
le = dfs[0].drop_duplicates(subset=['id'])
df = dfs[1].set_index('id')
df.update(le.set_index('id'))
df.reset_index(inplace=True)
le1 = df.drop_duplicates(subset=['id'])
df1 = dfs[2].set_index('id')
df1.update(le1.set_index('id'))
df1.reset_index(inplace=True)
My final output is:
myfinalupdateddf = df1
How can I do my above code into dynamic way by using for loop instead of creating multiple variables to get my final output.?
I have a dictionary where key is a file name and values are dataframes that looks like:
col1 col2
A 10
B 20
A 20
A 10
B 10
I want to groupby based on 'col1' to sum values in 'col2' and store it to new dataframe 'df' whose output should look like:
The output should look like:
Index A B
file1 40 30
file2 50 35
My code:
df=pd.DataFrame(columns=['A','B'])
for key, value in data.items():
cnt=(value.groupby('Type')['Packets'].sum())
print(cnt)
df.append(cnt,ignore_index=True)
Another suggested way with group-by, transpose, and row stack into dataframe.
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[10, 20, 20, 10, 10]})
df_2 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[30, 10, 15, 5, 25]})
df_1_agg = df_1.groupby(['col1']).agg({'col2':'sum'}).T.values
df_2_agg = df_2.groupby(['col1']).agg({'col2':'sum'}).T.values
pd.DataFrame(np.row_stack((df_1_agg, df_2_agg)), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})
Edited: to generalize, you need to put it into the function and loop through. Also, need to format the index (file{i}) for general cases.
lst_df = [df_1, df_2]
df_all = []
for i in lst_df:
# iterate every data faame
df_agg = i.groupby(['col1']).agg({'col2':'sum'}).T.values
# append to the accumulator
df_all.append(df_agg)
pd.DataFrame(np.row_stack(df_all), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})
You should try to avoid appending in a loop. This is inefficient and not recommended.
Instead, you can concatenate your dataframes into one large dataframe, then use pivot_table:
# aggregate values in your dictionary, adding a "file" series
df_comb = pd.concat((v.assign(file=k) for k, v in data.items()), ignore_index=True)
# perform 'sum' aggregation, specifying index, columns & values
df = df_comb.pivot_table(index='file', columns='col1', values='col2', aggfunc='sum')
Explanation
v.assign(file=k) adds a series file to each dataframe with value set to the filename.
pd.concat concatenates all the dataframes in your dictionary.
pd.DataFrame.pivot_table is a Pandas method which allows you to create Excel-style pivot tables via specifying index, columns, values and aggfunc (aggregation function).
I am looking to change part of the string in a column of a data frame. I, however, can not get it to update in the data frame. This is my code.
import pandas as pd
#File path
csv = '/home/test.csv'
#Read csv to pandas
df = pd.read_csv(nuclei_annotations_csv, header=None, names=['A', 'B', 'C', 'D', 'E', 'F'])
#Select Data to update
paths = df['A']
#Loop over data
for x in paths:
#Select data to updte
old = x[:36]
#Update value
new = '/Datasets/RetinaNetData'
#Replace
new_path = x.replace(old, new)
#Save values to DataFrame
paths.update(new_path)
#Print updated DataFrame
print(df)
The inputs and output I would like are:
Input:
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
OutPut:
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
Assuming that all of the rows are strings and all of them have at least 36 characters, you can use .str to get the part of the cells after the 36th character. Then you can just use the + operator to combine the new beginning with the remainder of each cell's contents:
df.A = '/Datasets/RetinaNetData' + df.A.str[36:]
As a general tip, methods like this that operate across the whole dataframe at once are going to be more efficient than looping over each row individually.
How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.
code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]
Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.
I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()
Using Python, I want to count the number of cells in a row that has data in it, in a pandas data frame and record the count in the leftmost cell of the row.
To count the number of cells missing data in each row, you probably want to do something like this:
df.apply(lambda x: x.isnull().sum(), axis='columns')
Replace df with the label of your data frame.
You can create a new column and write the count to it using something like:
df['MISSING'] = df.apply(lambda x: x.isnull().sum(), axis='columns')
The column will be created at the end (rightmost) of your data frame.
You can move your columns around like this:
df = df[['Count', 'M', 'A', 'B', 'C']]
Update
I'm wondering if your missing cells are actually empty strings as opposed to NaN values. Can you confirm? I copied your screenshot into an Excel workbook. My full code is below:
df = pd.read_excel('count.xlsx', na_values=['', ' '])
df.head() # You should see NaN for empty cells
df['M']=df.apply(lambda x: x.isnull().sum(), axis='columns')
df.head() # Column M should report the values: first row: 0, second row: 1, third row: 2
df = df[['Count', 'M', 'A', 'B', 'C']]
df.head() # Column order should be Count, M, A, B, C
Notice the na_values parameter in the pd.read_excel method.