I want to shuffle a pandas dataframe 'n' times and save the shuffled dataframe with a new name and then export it to a 'csv' file. What I mean is-
import pandas as pd
import sklearn
import numpy as np
from sklearn.utils import shuffle
df = pd.read_csv('example.csv')
Then something like this-
for i in np.arange(n):
df_%i = shuffle(df)
df_%i.to_csv('example.csv')
I appreciate any help. Thanks!
You can use
for i in range(n):
df.sample(frac= 1).to_csv(f"example_{i}.csv")
If you need to create an arbitrary number of variables, you should store them in a dictionary and you can reference them later by their keys; in this case the integer you loop over.
d = {}
for i in range(n):
d[i] = df.sample(frac=1) #d[i] = shuffle(df) in your case
d[i].to_csv(f'example_{i}.csv')
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)))
d = {}
for i in range(5):
d[i] = df.sample(frac=1)
d[1]
# 0 1 2
#0 6 3 2
#1 7 6 4
#2 2 6 9
d[2]
# 0 1 2
#2 2 6 9
#1 7 6 4
#0 6 3 2
Related
I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step
Suppose I have a n x k matrix X. And I want to get the sum across the columns, but for every permutation of the rows. So if my matrix is [[1,2],[3,4]] my desired output would be [1+2, 1+4, 3+2, 3+4]. I produce a MWE example with my first attempt at a solution. I'm hoping I can get some help to reduce the computation time.
My actual problem has n=160 and k=4, and it takes quite a while to run (as of writing this, it's still running).
import pandas as pd
import numpy as np
import itertools
n = 4
k = 3
X = np.random.randint(0, 10, (n, k))
df = pd.DataFrame(X)
df
0 1 2
0 2 9 2
1 7 6 4
2 3 7 0
3 5 0 0
ixi = df.index.tolist()
ixc = df.columns.tolist()
psum = np.array([df.lookup(i, ixc).sum() for i in
itertools.product(ixi, repeat=len(ixc))])
You can try functools.reduce:
from functools import reduce
reduce(np.add.outer, df.values.T).ravel()
Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i
You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35
Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))
I have a Pandas DataFrame with the following structure:
In [1]: df
Out[1]:
location_code month amount
0 1 1 10
1 1 2 11
2 1 3 12
3 1 4 13
4 1 5 14
5 1 6 15
6 2 1 23
7 2 2 25
8 2 3 27
9 2 4 29
10 2 5 31
11 2 6 33
I also have a DataFrame with the following:
In [2]: output_df
Out[2]:
location_code regression_coef
0 1 None
1 2 None
What I would like:
output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)
I would like to group by the location code and then perform a linear regression on the values of amount and store the coefficient. I have tried the following code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = []
for j in range(6): x.append(j+1)
for location_code, amount in gb:
trans = amount.tolist()
x = sm.add_constant(x)
model = sm.OLS(trans, x)
results = model.fit()
output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)
This code works, but my data set is somewhat large (about 5 gb) and a bit more complex, and this is taking a REALLY LONG TIME. I am wondering if there is a vectorized operation that can do this more efficiently? I know that using loops on a Pandas DataFrame is bad.
SOLUTION
After some tinkering around, I wrote a function that can be used with the apply method on a groupby.
def get_lin_reg_coef(series):
x=sm.add_constant(range(1,7))
result = sm.OLS(series, x).fit().params[1]
return result/series.mean()
gb = df.groupby('location_code')['amount']
output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)
Benchmarking this versus the iterative solution I had before, with varying input sizes gets:
DataFrame Rows Iterative Solution (sec) Vectorized Solution (sec)
370,000 81.42 84.46
1,850,000 448.36 365.66
3,700,000 1282.83 715.89
7,400,000 5034.62 1407.88
Clearly a lot faster as the dataset grows in size!
Without knowing more about the data, number of records, etc, this code should run faster:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = sm.add_constant(range(1,7))
def fit(stuff):
return sm.OLS(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()
output = gb.apply(fit)
I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316