Need to combine multiple rows based on index

Need to combine multiple rows based on index - python

I have a dataframe with values like
0 1 2
a 5 NaN 6
a NaN 2 NaN
Need the output by combining the two rows based on index 'a' which is same in both rows
Also need to add multiple columns and output as single column
Need the output as below. Value 13 since adding 5 2 6
0
a 13
Trying this using concat function but getting errors

How about using Pandas dataframe.sum() ?
import pandas as pd
import numpy as np
data = pd.DataFrame({"0":[5, np.NaN], "1":[np.NaN, 2], "2":[6,np.NaN]})
row_total = data.sum(axis = 1, skipna = True)
row_total.sum(axis = 0)
result:
13.0
EDIT: #Chris comment (did not see it while writing my answer) shows how to do it in one line, if all rows have same index.
data:
data = pd.DataFrame({"0":[5, np.NaN],
"1":[np.NaN, 2],
"2":[6,np.NaN]},
index=['a', 'a'])
gives:
0 1 2
a 5.0 NaN 6.0
a NaN 2.0 NaN
Then
data.groupby(data.index).sum().sum(1)
Returns
13.0

Related

Merge two dataframes of different lengths with matching ID and fill NaN values of main dataframe on two columns

I have two dataframes, the main dataframe has two columns for Lat and Long some of which have values and some of which are NaN. I have another dataframe that is a subset of this main dataframe with Lat and Long filled in with values. I'd like to fill in the main DataFrame with these values based on matching ID.
Main DataFrame:
ID Lat Long
0 9547507704 33.853682 -80.369867
1 9777677704 32.942332 -80.066165
2 5791407702 47.636067 -122.302559
3 6223567700 34.224719 -117.372550
4 9662437702 42.521828 -82.913680
... ... ... ...
968552 4395967002 NaN NaN
968553 6985647108 NaN NaN
968554 7996438405 NaN NaN
968555 9054647103 NaN NaN
968556 9184687004 NaN NaN
DataFrame to fill:
ID Lat Long
0 2392497107 36.824257 -76.272486
1 2649457102 37.633918 -77.507746
2 2952437110 37.511077 -77.528711
3 3379937304 39.119430 -77.569008
4 3773127208 36.909731 -76.070420
... ... ... ...
23263 9512327001 37.371059 -79.194838
23264 9677417002 38.406665 -78.913133
23265 9715167306 38.761194 -77.454184
23266 9767568404 37.022287 -76.319882
23267 9872047407 38.823017 -77.057818
The two dataframes are of different lengths.
EDIT for clarification: I need to replace the NaN in the Lat & Long columns of the main DataFrame with the Lat & Long from the subset if ID matches in both DataFrames. My DataFrames are both >60 columns, I am only trying to replace the NaN for those two columns.
Edit:
I went with this mapping solution although it isn't exactly what I'm looking for, I know there is a much more simple solution.
#mapping coordinates to NaN values in main
m = dict(zip(fill_df.ID,fill_df.Lat))
main_df.Lat = main_df.Lat.fillna(main_df.ID.map(m))
n = dict(zip(fill_df.ID,fill_df.Long))
main_df.Long = main_df.Long.fillna(main_df.ID.map(n))

new_df = pd.merge(main_df, sub_df, how='left', on='ID')
I guess the left join will do the job.

One approach is to use DataFrame.combine_first. This method aligns DataFrames on index and columns, so you need to set ID as the index of each DataFrame, call df_main.combine_first(df_filler), then reset ID back into a column. (Seems awkward; there's probably a more elegant approach.)
Assuming your main DataFrame is named df_main and your DataFrame to fill is named df_filler:
df_main.set_index('ID').combine_first(df_filler.set_index('ID')).reset_index()

This should do the trick:
import math
A = pd.DataFrame({'ID' : [1, 2, 3], 'Lat':[4, 5, 6], 'Long': [7, 8, float('nan')]})
B = pd.DataFrame({'ID' : [2, 3], 'Lat':[5, 6], 'Long': [8, 9]})
print('Old table:')
print(A)
print('Fix table:')
print(B)
for i in A.index.to_list():
for j in B.index.to_list():
if not A['ID'][i] == B['ID'][j]:
continue
if math.isnan(A['Lat'][i]):
A.at[i, 'Lat'] = B['Lat'][j]
if math.isnan(A['Long'][i]):
A.at[i, 'Long'] = B['Long'][j]
print('New table:')
print(A)
Returns:
ID Lat Long
0 1 4 7.0
1 2 5 8.0
2 3 6 NaN
Fix table:
ID Lat Long
0 2 5 8
1 3 6 9
New table:
ID Lat Long
0 1 4 7.0
1 2 5 8.0
2 3 6 9.0
Not very elegant but gets the job done :)

How do I fill na values in a column with the average of previous non-na and next non-na value in pandas?

Raw table:
Column A
5
nan
nan
15
New table:
Column A
5
10
10
15

One option might be the following (using fillna twice (with options ffill and bfill) and then averaging them):
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [np.nan, 5, np.nan, np.nan, 15]})
filled_series = [df['x'].fillna(method=m) for m in ('ffill', 'bfill')]
print(pd.concat(filled_series, axis=1).mean(axis=1))
# 0 5.0
# 1 5.0
# 2 10.0
# 3 10.0
# 4 15.0
As you can see, this works even if nan happens at the beginning or at the end.

select range of values for all columns in pandas dataframe

I have a dataframe 'DF', part of which looks like this:
I want to select only the values between 0 and 0.01, to form a new dataframe(with blanks where the value was over 0.01)
To do this, i tried:
similarity = []
for x in DF:
similarity.append([DF[DF.between(0, 0.01).any(axis=1)]])
simdf = pd.DataFrame(similarity)
simdf.to_csv("similarity.csv")
However, i get the error AttributeError: 'DataFrame' object has no attribute 'between'
How do i select a range of values and create a new data frame with these?

Just do the two comparisons:
df_new = df[(df>0) & (df<0.01)]
Example:
import pandas as pd
df = pd.DataFrame({"a":[0,2,4,54,56,4],"b":[4,5,7,12,3,4]})
print(df[(df>5) & (df<33)])
a b
0 NaN NaN
1 NaN NaN
2 NaN 7.0
3 NaN 12.0
4 NaN NaN
5 NaN NaN
If want blank string instead of NaN:
df[(df>5) & (df<33)].fillna("")

pandas: fill missing data in data frame columns

I have the following pandas data frame:
import numpy as np
import pandas as pd
timestamps = [1, 14, 30]
data = dict(quantities=[1, 4, 9], e_quantities=[1, 2, 3])
df = pd.DataFrame(data=data, columns=data.keys(), index=timestamps)
which looks like this:
quantities e_quantities
1 1 1
14 4 2
30 9 3
However, the timestamps should run from 1 to 52:
index = pd.RangeIndex(1, 53)
The following line provides the timestamps that are missing:
series_fill = pd.Series(np.nan, index=index.difference(df.index)).sort_index()
How can I get the quantities and e_quantities columns to have NaN values at these missing timestamps?
I've tried:
df = pd.concat([df, series_fill]).sort_index()
but it adds another column (0) and swaps the order of the original data frame:
0 e_quantities quantities
1 NaN 1.0 1.0
2 NaN NaN NaN
3 NaN NaN NaN
Thanks for any help here.

I think you are looking for reindex
df=df.reindex(index)

Pandas- set values to an empty dataframe

I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks

Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6

How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>

Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.

Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need to combine multiple rows based on index - python

Related

Merge two dataframes of different lengths with matching ID and fill NaN values of main dataframe on two columns

How do I fill na values in a column with the average of previous non-na and next non-na value in pandas?

select range of values for all columns in pandas dataframe

pandas: fill missing data in data frame columns

Pandas- set values to an empty dataframe

Categories

Resources