How store 2d matrix values in a variable? - python

import numpy as np
import pandas as df
from numpy import asarray
from numpy import save
files=np.load('arr.npy',allow_pickle=True)
#print(files)
data=df.DataFrame(files)
type(data)
rr=data.shape[0]
for i in range(0,rr):
res=data[0][i]
after running res variable contains last element
but i want all the values
so tell me how to store all the 2d matrix values in python ??
data variable is the dataframe
it contains 9339 rows and 2 columns
but i want 1st column it is the 32x32 matrix
how to store values res variable

Notice that res = data[0][i] initializes a new variable res on the first iteration of the loop (when i is 0), but then keeps reassigning its value to the value in the next row (staying on column 0).
I'm not sure exactly what you want, but it sounds like you just want the first column in a separate variable? Here is how to get the first column, as a pandas series and/or plain list, with a smaller example (9 rows and 2 columns)
import pandas as pd
random_data = np.random.rand(9,2)
data_df = pd.DataFrame(random_data)
print(data_df)
# this gets the first column as a pandas series. Change index from 0 to get another column.
print('\nfirst column:')
first_col = data_df[data_df.columns[0]]
print(first_col)
# if you want a plain list instead of a series
print('\nfirst column as list:')
print(first_col.tolist())
Output:
0 1
0 0.218237 0.323922
1 0.806697 0.371456
2 0.526571 0.993491
3 0.403947 0.299652
4 0.753333 0.542269
5 0.365885 0.534462
6 0.404583 0.514687
7 0.298897 0.637910
8 0.453891 0.234333
first column:
0 0.218237
1 0.806697
2 0.526571
3 0.403947
4 0.753333
5 0.365885
6 0.404583
7 0.298897
8 0.453891
Name: 0, dtype: float64
first column as list:
[0.21823726509923325, 0.8066974875381492, 0.526571422644495, 0.40394686954663594, 0.7533330239460391, 0.36588470364914194, 0.4045827678891364, 0.2988970490642284, 0.45389073978613426]

Related

How to randomly select a single value in Pandas data frame

I was searching forever to find an answer to this and no one seemed to have asked, so I am asking and providing my answer here.
How do you randomly select a single cell value in pandas data frame?
It is really simple. I was struggling finding the syntax for the formula. Also I was making the mistake of not starting my count from 0.
random_first_message = data_frame.loc[randint(0,7), 'First_Message']
This code selects a random row between 0 and 7, then selects the the column First_Message and provides that to the variable random_first_message.
Quick example:
#generating an random string df
from pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["First_Message"])
df
Result:
index
First_Message
0
cl3pf7uYI7
1
C6ZmXvu8Fy
2
QcK6sHfRHS
3
p59ZNSA9Bs
4
ctNlg0X23n
5
vGqXIyF95L
6
Bwr9ECqhst
7
Wam6VmgLbu
8
DniZeLQXNx
9
LH3QMGRrG6
Then simply chose the column and the number of sample(s) like this:
df['First_Message'].sample(3)
3 p59ZNSA9Bs
8 DniZeLQXNx
6 Bwr9ECqhst
Name: First_Message, dtype: object

Getting the columns of a pandas series

I have a pandas.core.series as such:
140228202800 25
130422174258 5
131213194708 3
130726171426 1
I would like to get the first column and second column separately
Column 1:
140228202800
130422174258
131213194708
130726171426
Column 2:
25
5
3
1
I tried the following but no luck.
my_series.iloc[:,0]
my_series.loc[:,0]
my_series[:,0]
The first "column" is the index you can get it using s.index or s.index.to_list() to get obtain it as a list.
To get the series values as a list use s.to_list and in order to get it as a numpy array use s.values.

Creating New Column in Pandas Data frame with For Loop Issue

I am trying to compare two columns (key.response and corr_answer) in a csv file using pandas and creating a new column "Correct_or_Not" that will contain a 1 in the cell if the key.response and corr_answer column are equal and a 0 if they are not. When I evaluate on their own outside of the loop they return the truth value I expect. The first part of the code is just me formatting the data to remove some brackets and apostrophes.
I tried using a for loop, but for some reason it puts a 0 in every column for 'Correct_or_Not".
import pandas as pd
df= pd.read_csv('exptest.csv')
df['key.response'] = df['key.response'].str.replace(']','')
df['key.response'] = df['key.response'].str.replace('[','')
df['key.response'] = df['key.response'].str.replace("'",'')
df['corr_answer'] = df['corr_answer'].str.replace(']','')
df['corr_answer'] = df['corr_answer'].str.replace('[','')
df['corr_answer'] = df['corr_answer'].str.replace("'",'')
for i in range(df.shape[0]):
if df['key.response'][i] == df['corr_answer'][i]:
df['Correct_or_Not']=1
else:
df['Correct_or_Not']=0
df.head()
key.response corr_answer Correct_or_Not
0 1 1 0
1 2 2 0
2 1 2 0
You can generate the Correct_or_Not column all at once without the loop:
df['Correct_or_Not'] = df['key.response'] == df['corr_answer']
and df['Correct_or_Not'] = df['Correct_or_Not'].astype(int) if you need the results as integers.
In your loop you forgot the index [i] when assigning the result. Like this the last row's result gets applied everywhere.
you can also do this
df['Correct_or_not']=0
for i in range(df.shape[0]):
if df['key.response'][i]==df['corr_answer'][i]:
df['Correct_or_not'][i]=1

Append a row to a dataframe

Fairly new to pandas and I have created a data frame called rollParametersDf:
rollParametersDf = pd.DataFrame(columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd'], index=[])
with the 4 column headings given. Which I would like to hold the reference dates for a study I am running. I want to add rows of data (one at a time) with the index name roll1, roll2..rolln that is created using the following code:
outsampleEnd = customCalender.iloc[[totalDaysAvailable]]
outsampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength+1]]
insampleEnd = customCalender.iloc[[totalDaysAvailable-outsampleLength]]
insampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength-insampleLength+1]]
print('roll',rollCount,'\t',outsampleEnd,'\t',outsampleStart,'\t',insampleEnd,'\t',insampleStart,'\t')
rollParametersDf.append({insampleStart,insampleEnd,outsampleStart,outsampleEnd})
I have tried using append but cannot get an individual row to append.
I would like the final dataframe to look like:
insampleStart insampleEnd outsampleStart outsampleEnd
roll1 1 5 6 8
roll2 2 6 7 9
:
rolln
You give key-values pairs to append
df = pd.DataFrame({'insampleStart':[], 'insampleEnd':[], 'outsampleStart':[], 'outsampleEnd':[]})
df = df.append({'insampleStart':[1,2], 'insampleEnd':[5,6], 'outsampleStart':[6,7], 'outsampleEnd':[8,9]}, ignore_index=True)
The pandas documentation has an example of appending rows to a DataFrame. This appending action is different from that of a list in that this appending action generates a new DataFrame. This means that for each append action you are rebuilding and reindexing the DataFrame which is pretty inefficient. Here is an example solution:
# create empty dataframe
columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd']
rollParametersDf = pd.DataFrame(columns=columns)
# loop through 5 rows and append them to the dataframe
for i in range(5):
# create some artificial data
data = np.random.normal(size=(1, len(columns)))
# append creates a new dataframe which makes this operation inefficient
# ignore_index causes reindexing on each call.
rollParametersDf = rollParametersDf.append(pd.DataFrame(data, columns=columns),
ignore_index=True)
print rollParametersDf
insampleStart insampleEnd outsampleStart outsampleEnd
0 2.297031 1.792745 0.436704 0.706682
1 0.984812 -0.417183 -1.828572 -0.034844
2 0.239083 -1.305873 0.092712 0.695459
3 -0.511505 -0.835284 -0.823365 -0.182080
4 0.609052 -1.916952 -0.907588 0.898772

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

Categories

Resources