I have a dictionary where the key is two parts, one the index coordinate and the other the column coordinate. I would like to use this dictionary to populate a pandas DataFrame based on these coordinates.
For example my dictionary looks like this:
final = {('BUV395', 'BUV496'): 0, ('BUV395', 'BUV563'): 0, ('BUV395', 'BUV615'): 0, ('BUV395', 'BUV661'): 0, etc...
The input to my function is the pandas DataFrame with the original data - just to give context to the code below:
def matrix_all_pairs(df):
dataframe = pd.DataFrame(index=range(0,len(df.index.values)),columns=range(0,len(df.index.values)))
dataframe.columns = df.index.values
idx = list(df.index.values)
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
if (r2_score(df.xs(fluor[0]), df.xs(fluor[1]))) < 0:
final[fluor] = 0
else:
final[fluor] = (r2_score(df.xs(fluor[0]), df.xs(fluor[1])))
for fluor, value in list_fluor:
x = value
dataframe.loc(idx.index(fluor[0]), fluor[1]) = x
dataframe.index = df.index.values
return(dataframe)
When I try to run this, it gives me "SyntaxError: can't assign to function call" for the line:
dataframe.loc(idx.index(fluor[0]), fluor[1]) = x
Is there a better way of doing this? I've seen multiple people say that populating an empty DataFrame using a loop is messy but I'm not sure how else I could do this?
I'm not sure how to post my data for people to work with - I'm new to this site.
Is this what you're asking? First item in each tuple is the "row/index" value and second item is the "column" header. Essentially, you have a multiindex series that you want to unstack into a single index dataframe.
df = pd.DataFrame.from_dict(final, orient='index')
df[['index','column']] = df.index.values.tolist()
df = df.set_index(['index','column'])[0].unstack()
Your example final dictionary has only one unique key in the first tuple elements so result would be:
column BUV496 BUV563 BUV615 BUV661
index
BUV395 0 0 0 0
Alternate example to show more obviously 2-dimensional dataframe.
final = {('BUV395', 'BUV496'): 0, ('BUV395', 'BUV563'): 0, ('BUV496', 'BUV395'): 0, ('BUV496', 'BUV563'): 0, ('BUV563', 'BUV395'): 0, ('BUV563', 'BUV496'): 0}
df = pd.DataFrame.from_dict(final, orient='index')
df[['index','column']] = df.index.values.tolist()
df = df.set_index(['index','column'])[0].unstack().rename_axis(None).rename_axis(None, axis=1)
BUV395 BUV496 BUV563
BUV395 NaN 0.0 0.0
BUV496 0.0 NaN 0.0
BUV563 0.0 0.0 NaN
Related
I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.
Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)
I have a dataframe with one column "Numbers" and I want to add a second column "Result". The values should be the sum of the previous two values in the "Numbers" column, otherwise NaN.
import pandas as pd
import numpy as np
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def add_prev_two_elems_to_DF(df):
numbers = "Numbers" # alias
result = "Result" # alias
df[result] = np.nan # empty column
result_index = list(df.columns).index(result)
for i in range(len(df)):
#row = df.iloc[i]
if i < 2: df.iloc[i,result_index] = np.nan
else: df.iloc[i,result_index] = df.iloc[i-1][numbers] + df.iloc[i-2][numbers]
add_prev_two_elems_to_DF(df)
display(df)
The output is:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
But this looks quite complicated. Can this be done easier and maybe faster? I am not looking for a solution with sum(). I want a general solution for any kind of function that can fill a column using values from other rows.
Edit 1: I forgot to import numpy.
Edit 2: I changed one line to this:
if i < 2: df.iloc[i,result_index] = np.nan
Looks like you could use rolling.sum together with shift. Since rollling.sum sums until the current row, we have to shift it down one row, so that each row value matches to the sum of the previous 2 rows:
df['Result'] = df['Numbers'].rolling(2).sum().shift()
Output:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
This is the shortest code I could develop. It outputs exactly the same table.
import numpy as np
import pandas as pd
#import swifter # apply() gets swifter
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def func(a: np.ndarray) -> float: # we expect 3 elements, but we don't check that
a.reset_index(inplace=True,drop=True) # the index now starts with 0, 1,...
return a[0] + a[1] # we use the first two elements, the 3rd is unnecessary
df["Result"] = df["Numbers"].rolling(3).apply(func)
#df["Result"] = df["Numbers"].swifter.rolling(3).apply(func)
display(df)
I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)
I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1
Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.
I am using Pandas and want to add rows to an empty DataFrame with columns already established.
So far my code looks like this...
def addRows(cereals,lines):
for i in np.arange(1,len(lines)):
dt = parseLine(lines[i])
dt = pd.Series(dt)
print(dt)
# YOUR CODE GOES HERE (add dt to cereals)
cereals.append(dt, ignore_index = True)
return(cereals)
However, when I run...
cereals = addRows(cereals,lines)
cereals
the dataframe returns with no rows, just the columns. I am not sure what I am doing wrong but I am pretty sure it has something to do with the append method. Anyone have any ideas as to what I am doing wrong?
There are two probably reasons your code is not operating as intended:
cereals.append(dt, ignore_index = True) is not doing what you think it is. You're trying to append a series, not a DataFrame there.
cereals.append(dt, ignore_index = True) does not modify cereals in place, so when you return it, you're returning an unchanged copy. An equivalent function would look like this:
--
>>> def foo(a):
... a + 1
... return a
...
>>> foo(1)
1
I haven't tested this on my machine, but I think you're fixed solution would look like this:
def addRows(cereals, lines):
for i in np.arange(1,len(lines)):
data = parseLine(lines[i])
new_df = pd.DataFrame(data, columns=cereals.columns)
cereals = cereals.append(new_df, ignore_index=True)
return cereals
by the way.. I don't really know where lines is coming from, but right away I would at least modify it to look like this:
data = [parseLine(line) for line in lines]
cereals = cereals.append(pd.DataFrame(data, cereals.columns), ignore_index=True)
How to add an extra row to a pandas dataframe
You could also create a new DataFrame and just append that DataFrame to your existing one. E.g.
>>> import pandas as pd
>>> empty_alph = pd.DataFrame(columns=['letter', 'index'])
>>> alph_abc = pd.DataFrame([['a', 0], ['b', 1], ['c', 2]], columns=['letter', 'index'])
>>> empty_alph.append(alph_abc)
letter index
0 a 0.0
1 b 1.0
2 c 2.0
As I noted in the link, you can also use the loc method on a DataFrame:
>>> df = empty_alph.append(alph_abc)
>>> df.loc[df.shape[0]] = ['d', 3] // df.shape[0] just finds next # in index
letter index
0 a 0.0
1 b 1.0
2 c 2.0
3 d 3.0