I would like to print the first and last 5 of my one hot encoding data. The code is below. When it prints the first and last 30 are printed.
Code:
from random import randint
import pandas_datareader.data as web
import pandas as pd
import datetime
import itertools as it
import numpy as np
import csv
df = pd.read_csv('C:Users\GrahamFam\Desktop\Data Archive\Daily3mid(Archive).txt')
df.columns = ['Date','b1','b2','b3']
df = df.set_index('Date')
reversed_df = df.iloc[::-1]
n=5
#print(reversed_df.drop(df.index[n:-n]))
df = pd.read_csv('C:Users\GrahamFam\Desktop\Data Archive\Daily3eve(Archive).txt')
df.columns = ['Date','b1','b2','b3']
df = df.set_index('Date')
reversed_df = df.iloc[::-1]
n=5
print(reversed_df.drop(df.index[n:-n]),("\n"))
BallOne = pd.get_dummies(reversed_df.b1)
BallTwo = pd.get_dummies(reversed_df.b2)
BallThree = pd.get_dummies(reversed_df.b3)
print(BallOne)
print(BallTwo)
print(BallThree)
You can use the head and tail function. You can read about them here
>>> DataFrame.head(n)
>>> DataFrame.tail(n)
where n is the no. of elements you want
If you're fine displaying the tail before the head, then can use np.r_ slicing from negative to positive:
import pandas as pd
import numpy as np
df = pd.DataFrame(list(range(30)))
df.iloc[np.r_[-3:3]]
# 0
#27 27
#28 28
#29 29
#0 0
#1 1
#2 2
Otherwise slice explicitly:
n = 3
l = len(df)
df.iloc[np.r_[0:n, l-n:l]]
# 0
#0 0
#1 1
#2 2
#27 27
#28 28
#29 29
Related
I have a 3 dimensional numpy array
([[[0.30706802]],
[[0.19451728]],
[[0.19380492]],
[[0.23329106]],
[[0.23849282]],
[[0.27154338]],
[[0.2616704 ]], ... ])
with shape (844,1,1) resulting from RNN model.predict()
y_prob = loaded_model.predict(X)
, my problem is how to convert it to a pandas dataframe.
I have used Keras
my objective is to have this:
0 0.30706802
7 0.19451728
21 0.19380492
35 0.23329106
42 ...
...
815 ...
822 ...
829 ...
836 ...
843 ...
Name: feature, Length: 78, dtype: float32
idea is to first flatten the nested list to list than convert it in df using from_records method of pandas dataframe
import numpy as np
import pandas as pd
data = np.array([[[0.30706802]],[[0.19451728]],[[0.19380492]],[[0.23329106]],[[0.23849282]],[[0.27154338]],[[0.2616704 ]]])
import itertools
data = list(itertools.chain(*data))
df = pd.DataFrame.from_records(data)
Without itertools
data = [i for j in data for i in j]
df = pd.DataFrame.from_records(data)
Or you can use flatten() method as mentioned in one of the answer, but you can directly use it like this
pd.DataFrame(data.flatten(),columns = ['col1'])
Here you go!
import pandas as pd
y = ([[[[11]],[[13]],[[14]],[[15]]]])
a = []
for i in y[0]:
a.append(i[0])
df = pd.DataFrame(a)
print(df)
Output:
0
0 11
1 13
2 14
3 15
Feel free to set your custom index values both for axis=0 and axis=1.
You could try:
s = pd.Series(your_array.flatten(), name='feature')
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html
You can then convert the series to a dataframe using s.to_frame()
Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i
You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35
Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))
This is an "ISO week 53 problem".
I have a pandas Series instance with index values representing the ISO week number:
import pandas as pd
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,53,53])
I want to randomly and equally replace all of the index = 53 indices with either index = 52 or index = 1.
For the above, this could be:
import pandas as pd
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,52,1])
or
import pandas as pd
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,1,52])
for example. How do I do this, please?
Thanks for any help.
EDIT
In numpy I used the following to achieve this:
from numpy import where
from numpy.random import shuffle
indices = where(timestamps == 53)[0]
number_of_indices = len(indices)
if number_of_indices == 0:
return # no iso week number 53 to fix.
shuffle(indices) # randomly shuffle the indices.
midway_index = number_of_indices // 2
timestamps[indices[midway_index:]] = 52 # precedence if only 1 timestamp.
timestamps[indices[: midway_index]] = 1
where the timestamps array is the pandas index value.
List comprehension should work if I understand you correctly:
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,53,53])
ts.index = [i if i != 53 else np.random.choice([1,52]) for i in ts.index]
1 1
1 1
2 1
2 2
52 3
52 1
1 2
dtype: int64
I want to shuffle a pandas dataframe 'n' times and save the shuffled dataframe with a new name and then export it to a 'csv' file. What I mean is-
import pandas as pd
import sklearn
import numpy as np
from sklearn.utils import shuffle
df = pd.read_csv('example.csv')
Then something like this-
for i in np.arange(n):
df_%i = shuffle(df)
df_%i.to_csv('example.csv')
I appreciate any help. Thanks!
You can use
for i in range(n):
df.sample(frac= 1).to_csv(f"example_{i}.csv")
If you need to create an arbitrary number of variables, you should store them in a dictionary and you can reference them later by their keys; in this case the integer you loop over.
d = {}
for i in range(n):
d[i] = df.sample(frac=1) #d[i] = shuffle(df) in your case
d[i].to_csv(f'example_{i}.csv')
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)))
d = {}
for i in range(5):
d[i] = df.sample(frac=1)
d[1]
# 0 1 2
#0 6 3 2
#1 7 6 4
#2 2 6 9
d[2]
# 0 1 2
#2 2 6 9
#1 7 6 4
#0 6 3 2
Let say I'm in the following situation:
import pandas as pd
import dask.dataframe as dd
import random
s = "abcd"
lst = 10*[0]+list(range(1,6))
n = int(1e2)
df = pd.DataFrame({"col1": [random.choice(s) for i in range(n)],
"col2": [random.choice(lst) for i in range(n)]})
df["idx"] = df.col1
df = df[["idx","col1","col2"]]
def fun(data):
if data["col2"].mean()>1:
return 2
else:
return 1
df.set_index("idx", inplace=True)
ddf1 = dd.from_pandas(df, npartitions=4)
gpb = ddf1.groupby("col1").apply(fun, meta=pd.Series(name='col3'))
ddf2 = ddf1.join(gpb.to_frame(), on="col1")
While ddf1.known_divisions is True ddf2.known_divisions is False I would like to preserve the same division on the ddf2 dataframe.
In one random example what I even got an empty partition.
for i in range(ddf1.npartitions):
print(i, len(ddf1.get_partition(i)), len(ddf2.get_partition(i)))
0 27 50
1 29 0
2 23 21
3 21 29