Unexpected row getting changed in pandas loc assignment - python

I want to copy a portion of a pandas dataframe onto a different portion, overwriting the existing values there. I am using .loc but more rows are changing than the ones I am referencing.
My example:
df = pd.DataFrame({
'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': range(1, 6),
'col3': range(6, 11)
})
print(df)
col1 col2 col3
0 A 1 6
1 B 2 7
2 C 3 8
3 D 4 9
4 E 5 10
I want to write the values of col2 and col3 from the C and D rows onto the A and B rows. Using .loc:
df.loc[0:2, ["col2", "col3"]] = df.loc[2:4, ["col2", "col3"]].values
print(df)
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 5 10
3 D 4 9
4 E 5 10
This does what I want for rows A and B, but row C has also changed. I expect only the first two rows to change, i.e. my expected output is
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 3 8
3 D 4 9
4 E 5 10
Why did the C row also change, and how may I do this with only changing the first two rows?

Unlike list slicing pandas.DataFrame.loc slicing is inclusive-inclusive
Warning Note that contrary to usual python slices, both the start and the stop
are included
so you should do
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

In addition, you can also pass a list of exhaustive elements, this way the rows need not to be consecutive:
df.loc[[0,1], ["col2", "col3"]] = df.loc[[2,3], ["col2", "col3"]].values

You went too far with the indices:
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

Related

Sum column in one dataframe based on row value of another dataframe

Say, I have one data frame df:
a b c d e
0 1 2 dd 5 Col1
1 2 3 ee 9 Col2
2 3 4 ff 1 Col4
There's another dataframe df2:
Col1 Col2 Col3
0 1 2 4
1 2 3 5
2 3 4 6
I need to add a column sum in the first dataframe, wherein it sums values of columns in the second dataframe df2, based on values of column e in df1.
Expected output
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col4 0
The Sum value in the last row is 0 because Col4 doesn't exist in df2.
What I tried: Writing some lamdas, apply function. Wasn't able to do it.
I'd greatly appreciate the help. Thank you.
Try
df['Sum']=df.e.map(df2.sum()).fillna(0)
df
Out[89]:
a b c d e Sum
0 1 2 dd 5 Col1 6.0
1 2 3 ee 9 Col2 9.0
2 3 4 ff 1 Col4 0.0
Try this. The following solution sums all values for a particular column if present in df2 using apply method and returns 0 if no such column exists in df2.
df1.loc[:,"sum"]=df1.loc[:,"e"].apply(lambda x: df2.loc[:,x].sum() if(x in df2.columns) else 0)
Use .iterrows() to iterate through a data frame pulling out the values for each row as well as index.
A nest for loop style of iteration can be used to grab needed values from the second dataframe and apply them to the first
import pandas as pd
df1 = pd.DataFrame(data={'a': [1,2,3], 'b': [2,3,4], 'c': ['dd', 'ee', 'ff'], 'd': [5,9,1], 'e': ['Col1','Col2','Col3']})
df2 = pd.DataFrame(data={'Col1': [1,2,3], 'Col2': [2,3,4], 'Col3': [4,5,6]})
df1['Sum'] = df1['a'].apply(lambda x: None)
for index, value in df1.iterrows():
sum = 0
for index2, value2 in df2.iterrows():
sum += value2[value['e']]
df1['Sum'][index] = sum
Output:
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col3 15

Create new Dataframe from matching two dataframe index's

I'm looking create a new dataframe from data in two separate dataframes - effectively matching the index of each cell and input into a two column dataframe. My real datasets have the exact same number of rows and columns, FWIW. Example below:
DF1:
Col1 Col2 Col3
1 2 3
3 8 7
DF2:
Col1 Col2 Col3
A B E
R S W
Desired Dataframe:
Col1 Col2
1 A
2 B
3 E
3 R
8 S
7 W
Thank you for your help!
here is your code
df3 = pd.Series(df1.values.ravel('F'))
df4 = pd.Series(df2.values.ravel('F'))
df = pd.concat([df3, df4], axis=1)
Use, DataFrame.to_numpy and .flatten:
df = pd.DataFrame(
{'Col1': df1.to_numpy().flatten(), 'Col2': df2.to_numpy().flatten()})
# print(df)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
You can do it easily like so:
list1 = df1.values.tolist()
list1 = [item for sublist in list1 for item in sublist]
list2 = df2.values.tolist()
list2 = [item for sublist in list2 for item in sublist]
df = {
'Col1': list1,
'Col2': list2
}
df = DataFrame(df)
print(df)
Hope this helps :)
pd.concat(map(lambda x: x.unstack().sort_index(level=-1), (df1, df2)), axis=1).reset_index(drop=True).rename(columns=['Col1', 'Col2'].__getitem__)
Result:
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
Another way (alternative):
pd.concat((df1.stack(),df2.stack()),axis=1).add_prefix('Col').reset_index(drop=True)
or:
d = {'Col1':df1,'Col2':df2}
pd.concat((v.stack() for k,v in d.items()),axis=1,keys=d.keys()).reset_index(drop=True)
#or pd.concat((d.values()),keys=d.keys()).stack().unstack(0).reset_index(drop=True)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W

apply function only within the same row index?

I have a dataframe with 2 sorted indexes and I want to apply diff on the column only within col1 in the order sorted by col2.
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.col3.diff(1)
This gives me
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 1
C 3 7 3
Above it applys diff by row.
What I want is
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 nan
C 3 7 nan
You'll want to use groupby to apply diff to each group:
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.groupby(axis=0, level='col1')['col3'].diff()
Since you already go through the heavy lifting of sort, you can diff and only assign within the group. You can't shift non-datetime indices, so either make a Series, or use np.roll, though that wraps around, and would lead to the wrong answer for a single group DataFrame
import pandas as pd
s = pd.Series(mini_df.index.get_level_values('col1'))
mini_df['diff'] = mini_df.col3.diff().where(s.eq(s.shift(1)).values)
col3 diff
col1 col2
A 1 1 NaN
4 3 2.0
B 2 4 NaN
C 3 7 NaN

Deleting half of dataframe rows which meet condition

I'm looking to extract a subset of a dataframe based on a condition. Let's say
df = pd.Dataframe({'Col1': [values1], 'Col2' = [values2], 'Col3' = [values3]})
I'd like to sort by Col2. Of the entries in Col2 that are negative (if any), I'd like to drop the largest half. So if values2 = [-5,10,13,-3,-1,-2], then I'd want to drop the rows corresponding to the values -5 and -3.
If I wanted to simply drop half the entire dataframe after sorting, I (think) could do
df = df.iloc[(df.shape[0]/2):]
Not sure how to introduce the conditionality of dropping half of only the negative values. The vast majority of my experience is in numpy - still getting used to thinking in terms of dataframes. Thanks in advance.
Data input
values1 = [-5,10,13,-3,-1,-2]
values2 = [-5,10,13,-3,-1,-2]
values3 = [-5,10,13,-3,-1,-2]
df = pd.DataFrame({'Col1': values1, 'Col2' : values2, 'Col3' : values3})
By using sample and concat , you can calculated the n from sample(n), i simply using 2 here
pd.concat([df[df.Col2>0],df[df.Col2<0].sample(2)])
Out[224]:
Col1 Col2 Col3
1 10 10 10
2 13 13 13
5 -2 -2 -2
4 -1 -1 -1
A straight-forward approach, first, you wanted your data-frame sorted:
In [16]: df = pd.DataFrame({'Col1': values1, 'Col2':values2, 'Col3': values3})
In [17]: df
Out[17]:
Col1 Col2 Col3
0 1 -5 a
1 2 10 b
2 3 13 c
3 4 -3 d
4 5 -1 e
5 6 -2 f
In [18]: df.sort_values('Col2', inplace=True)
In [19]: df
Out[19]:
Col1 Col2 Col3
0 1 -5 a
3 4 -3 d
5 6 -2 f
4 5 -1 e
1 2 10 b
2 3 13 c
Then, create a boolean mask for the negative values, use np.where to get the indices, cut the indices and half, then drop those indices:
In [20]: mask = (df.Col2 < 0)
In [21]: idx, = np.where(mask)
In [22]: df.drop(df.index[idx[:len(idx)//2]])
Out[22]:
Col1 Col2 Col3
5 6 -2 f
4 5 -1 e
1 2 10 b
2 3 13 c

Csv missing columns with Pandas Dataframe

I want to read a csv as dataframe into Pandas.
My csv file has the following format
a b c d
0 1 2 3 4 5
1 2 3 4 5 6
When I read the csv with Pandas I get the following dataframe
a b c d
0 1 2 3 4 5
1 2 3 4 5 6
When I execute print df.columns
I get something like :
Index([u'a', u'b', u'c', u'd'], dtype='object')
And when I execute print df.iloc[0]
I get :
a 2
b 3
c 4
d 5
Name: (0, 1)
I would like to have something a dataframe like
a b c d col1 col2
0 1 2 3 4 5
1 2 3 4 5 6
I don't know how many columns I will have to had. But I need as many columns as the number of value in the first line after the header. How can I achieve that ?
One way to do this would be to read in the data twice. Once with the first row (the original columns) skipped and the second with only the column names read (and all the rows skipped)
df = pd.read_csv(header=None, skiprows=1)
columns = pd.read_csv(nrows=0).columns.tolist()
columns
Output
['a', 'b', 'c', 'd']
Now find number of missing columns and use a list comprehension to make new columns
num_missing_cols = len(df.columns) - len(columns)
new_cols = ['col' + str(i+1) for i in range(num_missing_cols)]
df.columns = columns + new_cols
df
a b c d col1 col2
0 0 1 2 3 4 5
1 1 2 3 4 5 6

Categories

Resources