I have a data frame in form of a time series looking like this:
and a second table with additional information to the according column(names) like this:
Now, I want to combine the two, adding specific information from the second table into the header of the first one. With a result like this:
I have the feeling the solution to this is quite trivial, but somehow I just cannot get my head around it. Any help/suggestions/hints on how to approach this?
MWE to create to tables:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],columns=['a', 'b', 'c'])
df2 = pd.DataFrame([['a','b','c'],['a_desc','b_desc','c_desc'],['a_unit','b_unit','c_unit']]).T
df2.columns=['MSR','OBJDESC','UNIT']
You could get a metadata dict for each of the original column names and then update the original df
# store the column metadata you want in the header here
header_metadata = {}
# loop through your second df
for i, row in df2.iterrows():
# get the column name that this corresponds to
column_name = row.pop('MSR')
# we don't want `scale` metadta
row.pop('SCALE')
# we will want to add the data in dict(row) to our first df
header_metadata[column_name] = dict(row)
# rename the columns of your first df
df1.columns = (
'\n'.join((c, *header_metadata[c]))
for c in df1.columns
)
Related
I have two dataframes but they have more common columns and few distinct columns that only appeared in one of dataframe. I want to print out those distinct columns and common columns so can have better idea what columns are changed in another dataframe. I got some interesting post on SO but don't know why I got an error. I have two dataframes which has following shape:
df19.shape
(39831, 1952)
df20.shape
(39821, 1962)
here is dummy data:
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6],[11,13],[10,19],[21,23]], columns=['A', 'B'])
df2 = pd.DataFrame([[3, 4,0,7], [1, 3,9,2], [4, 6,3,8],[8,5,1,6]], columns=['A', 'B','C','D'])
current attempt
I came across SO and tried following:
res=pd.concat([df19, df20]).loc[df19.index.symmetric_difference(df20.index)]
res.shape
(10, 1984)
this gave me distinct rows but not distinct columns.I also tried this one but gave me error:
df19.compare(df20, keep_equal=True, keep_shape=True)
how should I render distinct rows and columns by comparing two dataframes in pandas? Does anyone knows of doing this easily in python? Any quick thoughts? Thanks
objective
I simply want to render distinct rows or columns to compare two dataframe by column name or what distinct rows that it has. for instance, compared to df1, what columns are newly added to df2; similarly what rows are added to df2 and so on. Any idea?
I would recommend getting the columns by filtering by the name of the columns.
common = [i for i in list(df1) if i in list(df2)]
temp = df2[common]
distinct = [i for i in list(df2) if i not in list(df1)]
temp = df2[distinct]
Thanks to #Shaido, this one worked for me:
import pandas as pd
df1=pd.read_csv(data1)
df2=pd.read_csv(data2)
df1_cols = df1.columns
df2_cols = df2.columns
common_cols = df1_cols.intersection(df2_cols)
df2_not_df1 = df2_cols.difference(df1_cols)
I have the following pandas dataframe:Example input data
I would like to iterate through the rows in only the first column and if it meets a condition (ie contains the string 'hello') I would like to add an empty row above this with only the first column being populated.
Example output
I have tried:
df.loc[0] ='My new title'
This adds a row but the title is on every column.
I'm guessing the syntax would be something like:
`
for i in df.iloc[0]:
if i == 'hello':
df.loc[-1] ='My new title'
print(df)
`
This is not working either - could someone point me in the right direction. Thanks
You can add a new row in this way
import pandas
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pandas.DataFrame(data, columns = ['Name', 'Age'])
df2 = pandas.DataFrame([['My new title', None]], columns = ['Name', 'Age'])
rownumber = 0
for i in df.iloc[0:,0]:
if i == "nick":
df = pandas.concat([df.iloc[:rownumber], df2, df.iloc[rownumber:]]).reset_index(drop=True)
rownumber += 1
rownumber += 1
this will add a new row before every "nick"
Excuse my being a total novice. I am writing several columns of data to a CSV file where I would like to maintain the headers every time I run the script to write new data to it.
I have successfully appended data to the CSV every time I run the script, but I cannot get the data to write in a new row. It tries to extend the data on the same row. I need it to have a line break.
df = pd.DataFrame([[date, sales_sum, qty_sum, orders_sum, ship_sum]], columns=['Date', 'Sales', 'Quantity', 'Orders', 'Shipping'])
df.to_csv(r'/profit.csv', header=None, index=None, sep=',', mode='a')
I would like the headers to be on the first row "Date, Sales, Quantity, Orders, Shipping"
Second row will display the actual values.
When running the script again, I would like the third row to be appended with the next day's values only. When passing headers it seems it wants to write the headers again, then write the data again below it. I prefer only one set of headers at the top of the CSV. Is this possible?
Thanks in advance.
not sure if I completely understood what you are trying to do, but checking the documentation it seems that you have a header option that can be set to false:
https[://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html][1]
Header : bool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be
aliases for the column names.
Changed in version 0.24.0: Previously defaulted to False for Series.
Is this what you are looking for?
You can define the main dataframe with the colmns you want.
then for each day you create a dataframe of only the new rows then append it to the main row.
Like this :
Main_df = pd.DataFrame(values, columns)
New_rows = pd.DataFrame(new_values, columns)
Main_df = Main_df.append(New_rows, ignore_index=True)
For example:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df)
# A B
#0 1 2
#1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2, ignore_index=True)
print(df)
# A B
#0 1 2
#1 3 4
#2 5 6
#3 7 8
I'm trying some operations on Excel file using pandas. I want to extract some columns from a excel file and add another column to those extracted columns. And want to write all the columns to new excel file. To do this I have to append new column to old columns.
Here is my code-
import pandas as pd
#Reading ExcelFIle
#Work.xlsx is input file
ex_file = 'Work.xlsx'
data = pd.read_excel(ex_file,'Data')
#Create subset of columns by extracting columns D,I,J,AU from the file
data_subset_columns = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
#Compute new column 'Percentage'
#'Num Labels' and 'Num Tracks' are two different columns in given file
data['Percentage'] = data['Num Labels'] / data['Num Tracks']
data1 = data['Percentage']
print data1
#Here I'm trying to append data['Percentage'] to data_subset_columns
Final_data = data_subset_columns.append(data1)
print Final_data
Final_data.to_excel('111.xlsx')
No error is shown. But Final_data is not giving me expected results. ( Data not getting appended)
There is no need to explicitly append columns in pandas. When you calculate a new column, it is included in the dataframe. When you export it to excel, the new column will be included.
Try this, assuming 'Num Labels' and 'Num Tracks' are in "D,I,J,AU" [otherwise add them]:
import pandas as pd
data_subset = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
data_subset['Percentage'] = data_subset['Num Labels'] / data_subset['Num Tracks']
data_subset.to_excel('111.xlsx')
The append function of a dataframe adds rows, not columns to the dataframe. Well, it does add columns if the appended rows have more columns than in the source dataframe.
DataFrame.append(other, ignore_index=False, verify_integrity=False)[source]
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
I think you are looking for something like concat.
Combine DataFrame objects horizontally along the x axis by passing in axis=1.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?
columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/