I have a list "profiles" looking like this for each element of the list: enter image description here
I want to create a dataframe out of this list, so that every account is in the dataframe.
When I create one dataframe for every element of the list manually and then append them all to one dataframe it works. However, I wanted to do a loop because I have about 40 elements in that list.
The code looks as follows (doing it manually):
df2 = pd.DataFrame(profiles[1])
df1 = df1.append(df2)
df3 = pd.DataFrame(profiles[2])
df1 = df1.append(df3)
... and so on
My loop looks like this:
for y in range(len(profiles)):
if x != y:
df1 = pd.DataFrame(profiles[x])
df2 = pd.DataFrame(profiles[y])
df1.append(df2)
Does anybody know how to make the loop work? Because it does not append the second dataframe.
you are overwriting your df1 in each iteration of the loop
IIUC, In the absence of the reproducible data and example, this is what you should add to your code
outside of the for loop, create an empty dataframe
df_final = pd.DataFrame()
inside the loop
df_final = pd.concat([df_final, df1], axis=1)
if your df1 and df2 both gets initialized in each iteration, then you can add them both to the df_final together as follows
df_final = pd.concat([df_final, df1, df2], axis=1)
at the end of the loop df_final will have all the df's appended to it
append is not an inPlace Operation, so you have to assign the result of the function (otherwise it gets lost)
Related
I'm starting to lose my mind a bit. I have:
df = pd.DataFrame(bunch_of_stuff)
df2 = df.loc[bunch_of_conditions].copy()
def transform_df2(df2):
df2['new_col'] = [rand()]*len(df2)
df2['existing_column_1'] = [list of new values]
return df2
df2 = transform_df2(df2)
I know what to re-insert df2 into df, such that it overwrites all its previous records.
What would the best way to do this be? df.loc[df2.index] = df2 ? This doesn't bring over any of the new columns in df2 though.
You have the right method with pd.concat. However you can optimize a little bit by using a boolean mask to avoid to recompute the index difference:
m = bunch_of_conditions
df2 = df[m].copy()
df = pd.concat([df[~m], df2]).sort_index()
Why do you want to make a copy of your dataframe? Is not simpler to use the dataframe itself?
One way I did it was:
df= pd.concat([df.loc[~df.index.isin(df2.index)],df2])
I want to replace df2 elements with df1 elements but according to that: If df2 first row first column has value '1' than df1 first row first column element is getting there, If it is zero than '0' stands. If df2 any row last column element is '1' than df1 that row last column element is coming there. It is going to be like that.
So i want to replace all df2 '1' element with df1 elements according to that rule. df3 is going to be like:
abcde0000;
abcd0e000;
abcd00e00;...
We can use apply function for this. But first you have concat both frames along axis 1. I am using a dummy table with just three entries. It can be applied for any number of rows.
import pandas as pd
import numpy as np
# Dummy data
df1 = pd.DataFrame([['a','b','c','d','e'],['a','b','c','d','e'],['a','b','c','d','e']])
df2 = pd.DataFrame([[1,1,1,1,1,0,0,0,0],[1,1,1,1,0,1,0,0,0],[1,1,1,1,0,0,1,0,0]])
# Display dataframe . May not work in python scripts. I used them in jupyter notebooks
display(df1)
display(df2)
# Concat DFs
df3 = pd.concat([df1,df2],axis=1)
display(df3)
# Define function for replacing
def replace(letters,indexes):
seek =0
for i in range(len(indexes)):
if indexes[i]==1:
indexes[i]=letters[seek]
seek+=1
return ''.join(list(map(str,indexes)))
# Applying replace function to dataframe
df4 = df3.apply(lambda x: replace(x[:5],x[5:]),axis=1)
# Display df4
display(df4)
The result is
0 abcde0000
1 abcd0e000
2 abcd00e00
dtype: object
I think this will solve your problem
Quite new to python for data analysis, still a noob.
I have a list of pandas data frames (+100) who's variables are saved into a list.
I then have the variables saved in another list in string format to add into the dataFrames as an identifier when plotting.
I have defined a function to prepare the tables for later feature engineering.
I want to iterate through each data frame and add the corresponding strings into a column called "Strings"
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
def mindex(df):
# remove time index and insert Strings column
df.reset_index(inplace=True)
df.insert(1, "Strings", "")
# iterate through each table adding the string values
for item in enumerate(df):
for item2 in strings:
df['Strings'] = item2
# the loop to cycle through all the dateframes using the function above
for i in df:
mindex(i)
When ever I use the function above it only fills the last value into all of the dataframes. I would like to note that all the dataframes are within the same date range, as I have tried to use this as a way to stop the iteration with no win.
Can anyone point me in the right direction! Google has not been my friend so far
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
for s, d in zip(strings, df):
d['Strings'] = s
In line df['Strings'] = item2 you assign variable item2 into entire column df["Strings"].
So first iteration assigns "df1", second assigns "df2" and ends with "df3" and this is what you see finally.
if you want to have in column Strings entirely populated by "df1" for df1, "df2" for df2 etc. you have to:
def mindex(dfs: list, strings: list) -> list:
final_dfs = []
for single_df, df_name in zip(dfs, strings):
single_df = single_df.copy()
single_df.reset_index(inplace=True)
single_df.insert(1, "Strings", "")
single_df['Strings'] = df_name
final_dfs.append(single_df)
return final_dfs
dfs = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
result = mindex(dfs, strings)
Few takeaways:
if you define list of dfs, name it dfs (plural), not df.
dfs = [df1, df2, df3]
If you iterate through pandas DataFrame, use df.iterrows(). It will generate indices and rows, so you don't need to apply enumerate.
for idx, row in df.iterrows():
....
if you use variable in for loop that is not going to be used, like in your example item, use underscore instead. It is good practice for useless variable:
for _ in enumerate(df):
for item2 in strings:
df['Strings'] = item2
Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.
I am iteratively processing a couple of "groups" and I would like to add them together to a dataframe with every group being identified by a 2nd level index.
This:
print pd.concat([df1, df2, df3], keys=["A", "B", "C"])
was suggested to me - but it doesn't play well with iteration.
I am currently doing
data_all = pd.DataFrame([])
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all = pd.concat([data_all, group], ignore_index=False)
But the last line totally destroys my multi-index and I cannot reconstruct it.
Can you give me a hand?
Should be able just make data_all a list and concatenate once at the end:
data_all = []
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all.append(group)
data_all = pd.concat(data_all, ignore_index=False)
Also keep in mind that pandas' concat works with iterators. Something like yield group may be more efficient than appending to a list each time. I haven't profiled it though!