I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks
Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6
How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>
Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.
Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6
Related
I have two datasets with identical column headers and I would like to remove ALL data that is 100% identical, and just have what they do not have exactly in common remaining. How could I go about doing that?
Thank you for your time!
To get everything BUT the intersection of two pandas datasets, try this:
# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
There is one caveat though, when filtering with boolean masks, your int values might turn into floats. By default, pandas replaces unwanted (False) values with the float version of NAN and converts the entire column to float. You can see this happening in the example below.
To circumvent this, explicitly declare the datatype when creating the dataframe.
Example
import pandas as pd
df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")
df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")
# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
print(f"result\n{result}\n")
Input
csv1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
csv2
A B C
0 1 2 3
1 4 5 6
2 10 11 12
Output
result
A B C
0 7.0 8.0 9.0
1 10.0 11.0 12.0
I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)
I have a pandas DataFrame with a multi-index like this:
import pandas as pd
import numpy as np
arr = [1]*3 + [2]*3
arr2 = list(range(3)) + list(range(3))
mux = pd.MultiIndex.from_arrays([
arr,
arr2
], names=['one', 'two'])
df = pd.DataFrame({'a': np.arange(len(mux))}, mux)
df
a
one two
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5
I have a function that takes a slice of a DataFrame and needs to assign a new column to the rows that have been sliced:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
#pass in a slice of the df with only records that have the last value for 'two'
work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
However calling the function results in the error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# This is added back by InteractiveShellApp.init_path()
How can I create a new column 'b' in the original DataFrame and assign its values for only the rows that were passed to the function, leaving the rest of the rows nan?
The desired output is:
a b
one two
1 0 0 nan
1 1 1 nan
1 2 2 4
2 0 3 nan
2 1 4 nan
2 2 5 10
NOTE: In the work function I'm actually doing a bunch of complex operations involving calling other functions to generate the values for the new column so I don't think this will work. Multiplying by 2 in my example is just for illustrative purposes.
You actually don't have an error, but just a warning. Try this:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
return df
#pass in a slice of the df with only records that have the last value for 'two'
new_df = work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
Then:
df.reset_index().merge(new_df, how="left").set_index(["one","two"])
Output:
a b
one two
1 0 0 NaN
1 1 NaN
2 2 4.0
2 0 3 NaN
1 4 NaN
2 5 10.0
I don't think you need a separate function at all. Try this...
df['b'] = df['a'].where(df.index.isin(df.index.get_level_values('two')[-1:], level=1))*2
The Series.where() function being called on df['a'] here should return a series where values are NaN for rows that do not result from your query.
I have one two source of data. One data is old and one is current version of same data.
I need to find new and updated and deleted rows in this two data.
Here is an example. Updated in a sense values of any column is changed from old data.
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'id':[1,2,3,4],'b':[4,np.nan,6,12]})
>>> df2 = pd.DataFrame({'id':[2,1,3,5],'b':[np.nan,40,6,6]})
>>> df1
id b
0 1 4.0
1 2 NaN
2 3 6.0
3 4 12.0
>>> df2
id b
0 2 NaN
1 1 40.0
2 3 6.0
3 5 6.0
here id is primary key for table.
I can easily find new rows from comparing primary key.
>>> df2[~df2.id.isin(df1.id)]
id b
3 5 6.0
But i am having trouble finding updated rows in new data source.
I tried following
>>>tmp = df1.merge(df2)
>>> df2[(~df2.id.isin(tmp.id)) & (df2.id.isin(df1.id))]
id b
1 1 40.0
This works for given case. But when i apply same thing to my original data frame -(shape (97000,58) and two columns combined makes a PK)- is not giving desired result. Its giving rows that are not updated.
My question is 'Is this the right way to achieve this?'.
How can i improve this?
Get the intersection of the ids and simply compare using ==. This is only possible because you have identically-labeled data frames (i.e. same indexes - due to the intersection - and same columns).
ids = set(df1.id.unique()).intersection(df2.id)
d1 = df1[df1.id.isin(ids)].set_index('id').sort_index()
d2 = df2[df2.id.isin(ids)].set_index('id').sort_index()
comp = (d1 == d2) | (pd.isnull(d1) & pd.isnull(d2))
which gives a boolean data frame with True values wherever values are equal, and False values wherever they differ
id b
0 1 False
1 2 True
2 3 True
I tried to insert a new row to a dataframe named 'my_df1' using the my_df1.loc function.But in the result ,the new row added has NaN values
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
my_df1.loc[3] = pd.Series([5,5,5])
Result displayed is as below
A B C
0 1.0 4.0 a
1 2.0 5.0 b
2 3.0 6.0 c
3 NaN NaN NaN
The reason that is all NaN is that my_df1.loc[3] as index (A,B,C) while pd.Series([5,5,5]) as index (0,1,2). When you do series1=series2, pandas only copies values of common indices, hence the result.
To fix this, do as #anky_91 says, or if you already has a series, use its values:
my_df1.loc[3] = my_series.values
Finally I found out how to add a Series as a row or column to a dataframe
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
Code1 adds a new column 'D' and values 5,5,5 to the dataframe
my_df1.loc[:,'D'] = pd.Series([5,5,5],index = my_df1.index)
print(my_df1)
Code2 adds a new row with index 3 and values 3,4,3,4 to the dataframe in code 1
my_df1.loc[3] = pd.Series([3,4,3,4],index = ('A','B','C','D'))
print(my_df1)