pandas dataframe, how to add new row efficiently - python

I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?

columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/

Related

how to render distinct columns/rows by comparing two dataframes in pandas?

I have two dataframes but they have more common columns and few distinct columns that only appeared in one of dataframe. I want to print out those distinct columns and common columns so can have better idea what columns are changed in another dataframe. I got some interesting post on SO but don't know why I got an error. I have two dataframes which has following shape:
df19.shape
(39831, 1952)
df20.shape
(39821, 1962)
here is dummy data:
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6],[11,13],[10,19],[21,23]], columns=['A', 'B'])
df2 = pd.DataFrame([[3, 4,0,7], [1, 3,9,2], [4, 6,3,8],[8,5,1,6]], columns=['A', 'B','C','D'])
current attempt
I came across SO and tried following:
res=pd.concat([df19, df20]).loc[df19.index.symmetric_difference(df20.index)]
res.shape
(10, 1984)
this gave me distinct rows but not distinct columns.I also tried this one but gave me error:
df19.compare(df20, keep_equal=True, keep_shape=True)
how should I render distinct rows and columns by comparing two dataframes in pandas? Does anyone knows of doing this easily in python? Any quick thoughts? Thanks
objective
I simply want to render distinct rows or columns to compare two dataframe by column name or what distinct rows that it has. for instance, compared to df1, what columns are newly added to df2; similarly what rows are added to df2 and so on. Any idea?
I would recommend getting the columns by filtering by the name of the columns.
common = [i for i in list(df1) if i in list(df2)]
temp = df2[common]
distinct = [i for i in list(df2) if i not in list(df1)]
temp = df2[distinct]
Thanks to #Shaido, this one worked for me:
import pandas as pd
df1=pd.read_csv(data1)
df2=pd.read_csv(data2)
df1_cols = df1.columns
df2_cols = df2.columns
common_cols = df1_cols.intersection(df2_cols)
df2_not_df1 = df2_cols.difference(df1_cols)

Creating multiindex header in Pandas

I have a data frame in form of a time series looking like this:
and a second table with additional information to the according column(names) like this:
Now, I want to combine the two, adding specific information from the second table into the header of the first one. With a result like this:
I have the feeling the solution to this is quite trivial, but somehow I just cannot get my head around it. Any help/suggestions/hints on how to approach this?
MWE to create to tables:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],columns=['a', 'b', 'c'])
df2 = pd.DataFrame([['a','b','c'],['a_desc','b_desc','c_desc'],['a_unit','b_unit','c_unit']]).T
df2.columns=['MSR','OBJDESC','UNIT']
You could get a metadata dict for each of the original column names and then update the original df
# store the column metadata you want in the header here
header_metadata = {}
# loop through your second df
for i, row in df2.iterrows():
# get the column name that this corresponds to
column_name = row.pop('MSR')
# we don't want `scale` metadta
row.pop('SCALE')
# we will want to add the data in dict(row) to our first df
header_metadata[column_name] = dict(row)
# rename the columns of your first df
df1.columns = (
'\n'.join((c, *header_metadata[c]))
for c in df1.columns
)

Get Non Empty Columnin Pandas Dataframe

In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]

How to multiple all rows of a Pandas dataframe by a single row in another Pandas dataframe?

I have this Dataframe in python
and I want to multiple every row in the first dataframe by this single row in the dataframe below as a vector
Some things I have tried from googling : df.mul, df.apply. But it seems to multiply the two frames together normally instead of a vectorized operation
Example data:
df = pd.DataFrame({'x':[1,2,3], 'y':[1,2,3]})
v1 = pd.DataFrame({'x':[2], 'y':[3]})
Multiply DataFrame with row:
df.multiply(np.array(v1), axis='columns')
If the use case needs accurate matching of columns
Example:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
coeffs_df = pd.DataFrame([[10, 9]], columns=['y', 'x'])
Need to convert the df with single row (coeffs_df) to a series first, the perform multiply
df.multiply(coeffs_df.iloc[0], axis='columns')

Pandas DataFrame to List of Lists

It's easy to turn a list of lists into a pandas dataframe:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
But how do I turn df back into a list of lists?
lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]
You could access the underlying array and call its tolist method:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]
If the data has column and index labels that you want to preserve, there are a few options.
Example data:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
columns=('first', 'second', 'third'), \
index=('alpha', 'beta'))
>>> df
first second third
alpha 1 2 3
beta 3 4 5
The tolist() method described in other answers is useful but yields only the core data - which may not be enough, depending on your needs.
>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]
One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.
>>> df.to_json()
{
"first":{"alpha":1,"beta":3},
"second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}
>>> df.to_json(orient='split')
{
"columns":["first","second","third"],
"index":["alpha","beta"],
"data":[[1,2,3],[3,4,5]]
}
Cumbersome but may be useful.
The good news is that it's pretty straightforward to build lists for the columns and rows:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]
This yields:
>>> print(f"columns: {columns}\nrows: {rows}")
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
If the None as the name of the index is bothersome, rename it:
df = df.rename_axis('stage')
Then:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}")
columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
I wanted to preserve the index, so I adapted the original answer to this solution:
list_df = df.reset_index().values.tolist()
Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:
pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)
I don't know if it will fit your needs, but you can also do:
>>> lol = df.values
>>> lol
array([[1, 2, 3],
[3, 4, 5]])
This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.
I had this problem: how do I get the headers of the df to be in row 0 for writing them to row 1 in the excel (using xlsxwriter)? None of the proposed solutions worked, but they pointed me in the right direction. I just needed one line more of code
# get csv data
df = pd.read_csv(filename)
# combine column headers and list of lists of values
lol = [df.columns.tolist()] + df.values.tolist()
Maybe something changed but this gave back a list of ndarrays which did what I needed.
list(df.values)
Not quite relate to the issue but another flavor with same expectation
converting data frame series into list of lists to plot the chart using create_distplot in Plotly
hist_data=[]
hist_data.append(map_data['Population'].to_numpy().tolist())
"df.values" returns a numpy array. This does not preserve the data types. An integer might be converted to a float.
df.iterrows() returns a series which also does not guarantee to preserve the data types. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
The code below converts to a list of list and preserves the data types:
rows = [list(row) for row in df.itertuples()]
If you wish to convert a Pandas DataFrame to a table (list of lists) and include the header column this should work:
import pandas as pd
def dfToTable(df:pd.DataFrame) -> list:
return [list(df.columns)] + df.values.tolist()
Usage (in REPL):
>>> df = pd.DataFrame(
[["r1c1","r1c2","r1c3"],["r2c1","r2c2","r3c3"]]
, columns=["c1", "c2", "c3"])
>>> df
c1 c2 c3
0 r1c1 r1c2 r1c3
1 r2c1 r2c2 r3c3
>>> dfToTable(df)
[['c1', 'c2', 'c3'], ['r1c1', 'r1c2', 'r1c3'], ['r2c1', 'r2c2', 'r3c3']]
The solutions presented so far suffer from a "reinventing the wheel" approach. Quoting #AMC:
If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
If you convert a dataframe to a list of lists you will lose information - namely the index and columns names.
My solution: use to_dict()
dict_of_lists = df.to_dict(orient='split')
This will give you a dictionary with three lists: index, columns, data. If you decide you really don't need the columns and index names, you get the data with
dict_of_lists['data']
We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:
# Empty list
row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.Date, rows.Event, rows.Cost]
# append the list to the final list
row_list.append(my_list)
# Print
print(row_list)
We can successfully extract each row of the given data frame into a list
This is very simple:
import numpy as np
list_of_lists = np.array(df)
Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
To quote a comment by #jpp:
In practice, there's often no need to convert the NumPy array into a list of lists.
If a Pandas DataFrame/Series won't work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.
A function I wrote that allows including the index column or the header row:
def df_to_list_of_lists(df, index=False, header=False):
rows = []
if header:
rows.append(([df.index.name] if index else []) + [e for e in df.columns])
for row in df.itertuples():
rows.append([e for e in row] if index else [e for e in row][1:])
return rows

Categories

Resources