I have several dataframes in a list, obtained after using np.array_split and I want to concat some of then into a single dataframe. In this example, I want to concat 3 dataframes contained in b (all but the 2nd one, which is the element b[1] in the list):
df = pd.DataFrame({'country':['a','b','c','d'],
'gdp':[1,2,3,4],
'iso':['x','y','z','w']})
a = np.array_split(df,4)
i = 1
b = a[:i]+a[i+1:]
desired_final_df = pd.DataFrame({'country':['a','c','d'],
'gdp':[1,3,4],
'iso':['x','z','w']})
I have tried to create an empty df and then use append through a loop for the elements in b but with no complete success:
CV = pd.DataFrame()
CV = [CV.append[(b[i])] for i in b] #try1
CV = [CV.append(b[i]) for i in b] #try2
CV = pd.DataFrame([CV.append[(b[i])] for i in b]) #try3
for i in b:
CV.append(b) #try4
I have reached to a solution which works but it is not efficient:
CV = pd.DataFrame()
CV = [CV.append(b) for i in b][0]
In this case, I get in CV three times the same dataframe with all the rows and I just get the first of them. However, in my real case, in which I have big datasets, having three times the same would result in much more time of computation.
How could I do that without repeating operations?
According to the docs, DataFrame.append does not work in-place, like lists. The resulting DataFrame object is returned instead. Catching that object should be enough for what you need:
df = pd.DataFrame()
for next_df in list_of_dfs:
df = df.append(next_df)
You may want to use the keyword argument ignore_index=True in the append call so that the indices become continuous, instead of starting from 0 for each appended DataFrame (assuming that the index of the DataFrames in the list all start from 0).
To cancatenate multiple DFs, resetting index, use pandas.concat:
pd.concat(b, ignore_index=True)
output
country gdp iso
0 a 1 x
1 c 3 z
2 d 4 w
Related
I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.
Working with Python Pandas 0.19.1.
I'm calling a function in loops, which returns a numeric list with length of 4 each time. What's the easiest way of concatenating them into a DataFrame?
I'm doing this:
result = pd.DataFrame()
for t in dates:
result_t = do_some_stuff(t)
result.append(result_t, ignore_index=True)
The problem is that it concatenates along the column instead of by rows. If dates has a length of 250, it gives a single-column df with 1000 rows. Instead, what I want is a 250 x 4 df.
I think you need append all DataFrames to list result and then use concat:
result = []
for t in dates:
result.append(do_some_stuff(t))
print (pd.concat(result, axis=1))
I have a specific series of datasets which come in the following general form:
import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})
The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30.
My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):
df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3
So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:
Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:
df['result'] = ' '.join(df.ix[:,1:])
Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings:
Concatenate row-wise across specific columns of dataframe
The key to operate in columns (Series) of strings en mass is the Series.str accessor.
I can think of two .str methods to do what you want.
str.cat()
The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:
column_names = df.columns[1:] # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')
Or, in one line:
df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')
str.join()
The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:
tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')
Or, in one line:
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.
Other than using apply to concatenate the strings, you can also use agg to do so.
df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]:
0 a d g i
1 b e h j
2 c f i k
dtype: object
Here is a slightly alternative solution:
In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()
In [58]: df
Out[58]:
n t0 t1 t2 t3 result
0 92 a d g i a d g i
1 916 b e h j b e h j
2 363 c f i k c f i k
I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values
I'm starting from the pandas DataFrame documentation here: Introduction to data structures
I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.
I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.
I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.
Note: I'm using Python 2.7.
import datetime as dt
import pandas as pd
import scipy as s
if __name__ == '__main__':
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
dates.sort()
valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
valdict[symb] = pd.Series( s.zeros( len(dates)), dates )
for thedate in dates:
if thedate > dates[0]:
for symb in valdict:
valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]
print valdict
NEVER grow a DataFrame row-wise!
TLDR; (just read the bold text)
Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.
Here is my advice: Accumulate data in a list, not a DataFrame.
Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.
data = []
for row in some_function_that_yields_data():
data.append(row)
df = pd.DataFrame(data)
pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.
Pros of this approach:
It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.
Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).
dtypes are automatically inferred (rather than assigning object to all of them).
A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.
If you aren't convinced yet, this is also mentioned in the documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
*** Update for pandas >= 1.4: append is now DEPRECATED! ***
As of pandas 1.4, append has now been deprecated! Use pd.concat instead. See the release notes
These options are horrible
append or concat inside a loop
Here is the biggest mistake I've seen from beginners:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
# or similarly,
# df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)
Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.
The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)
df.dtypes
A object # yuck!
B float64
C object
dtype: object
Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:
df.infer_objects().dtypes
A int64
B float64
C object
dtype: object
loc inside a loop
I have also seen loc used to append to a DataFrame that was created empty:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.
Empty DataFrame of NaNs
And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
It creates a DataFrame of object columns, like the others.
df.dtypes
A object # you DON'T want this
B object
C object
dtype: object
Appending still has all the issues as the methods above.
for i, (a, b, c) in enumerate(some_function_that_yields_data()):
df.iloc[i] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
Here's a couple of suggestions:
Use date_range for the index:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaNs) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # With 0s rather than NaNs
To do these type of calculations for the data, use a NumPy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
If I have to keep appending new data into this newDF from more than
one oldDFs, I just use a for loop to iterate over
pandas.DataFrame.append()
Note: append() is deprecated since version 1.4.0. Use concat()
Initialize empty frame with column names
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df
Add a new record to a frame
my_df.loc[len(my_df)] = [2, 4, 5]
You also might want to pass a dictionary:
my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic
Append another frame to your existing frame
col_names = ['A', 'B', 'C']
my_df2 = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)
Performance considerations
If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.
If you plan to do thins inside a big loop (say 10M records or so), you are better off using a mixture of these two;
fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe.
This would boost your performance by around 10 times.
Simply:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.zeros([rows,columns])
Then fill it.
Assume a dataframe with 19 rows
index=range(0,19)
index
columns=['A']
test = pd.DataFrame(index=index, columns=columns)
Keeping Column A as a constant
test['A']=10
Keeping column b as a variable given by a loop
for x in range(0,19):
test.loc[[x], 'b'] = pd.Series([x], index = [x])
You can replace the first x in pd.Series([x], index = [x]) with any value
This is my way to make a dynamic dataframe from several lists with a loop
x = [1,2,3,4,5,6,7,8]
y = [22,12,34,22,65,24,12,11]
z = ['as','ss','wa', 'ss','er','fd','ga','mf']
names = ['Bob', 'Liz', 'chop']
a loop
def dataF(x,y,z,names):
res = []
for t in zip(x,y,z):
res.append(t)
return pd.DataFrame(res,columns=names)
Result
dataF(x,y,z,names)
# import pandas library
import pandas as pd
# create a dataframe
my_df = pd.DataFrame({"A": ["shirt"], "B": [1200]})
# show the dataframe
print(my_df)