Creating Dataframe with numpy array with index and columns [duplicate] - python

I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below:
data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])
I'd like the resulting DataFrame to have Row1 and Row2 as index values, and Col1, Col2 as header values
I can specify the index as follows:
df = pd.DataFrame(data,index=data[:,0]),
however I am unsure how to best assign column headers.

You need to specify data, index and columns to DataFrame constructor, as in:
>>> pd.DataFrame(data=data[1:,1:], # values
... index=data[1:,0], # 1st column as index
... columns=data[0,1:]) # 1st row as the column names
edit: as in the #joris comment, you may need to change above to np.int_(data[1:,1:]) to have correct data type.

Here is an easy to understand solution
import numpy as np
import pandas as pd
# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
[6. , 2.2]])
# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
Column1 Column2
0 5.8 2.8
1 6.0 2.2

I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying "option 2" from this great answer, you could do it like this:
import pandas
import numpy
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
df = pandas.DataFrame(values, index=index)

This can be done simply by using from_records of pandas DataFrame
import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)

>>import pandas as pd
>>import numpy as np
>>data.shape
(480,193)
>>type(data)
numpy.ndarray
>>df=pd.DataFrame(data=data[0:,0:],
... index=[i for i in range(data.shape[0])],
... columns=['f'+str(i) for i in range(data.shape[1])])
>>df.head()
[![array to dataframe][1]][1]

Here simple example to create pandas dataframe by using numpy array.
import numpy as np
import pandas as pd
# create an array
var1 = np.arange(start=1, stop=21, step=1).reshape(-1)
var2 = np.random.rand(20,1).reshape(-1)
print(var1.shape)
print(var2.shape)
dataset = pd.DataFrame()
dataset['col1'] = var1
dataset['col2'] = var2
dataset.head()

Adding to #behzad.nouri 's answer - we can create a helper routine to handle this common scenario:
def csvDf(dat,**kwargs):
from numpy import array
data = array(dat)
if data is None or len(data)==0 or len(data[0])==0:
return None
else:
return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)
Let's try it out:
data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)
In [61]: csvDf(data)
Out[61]:
a b c
row1 row1cola row1colb row1colc
row2 row2cola row2colb row2colc
row3 row3cola row3colb row3colc

I think this is a simple and intuitive method:
data = np.array([[0, 0], [0, 1] , [1, 0] , [1, 1]])
reward = np.array([1,0,1,0])
dataset = pd.DataFrame()
dataset['StateAttributes'] = data.tolist()
dataset['reward'] = reward.tolist()
dataset
returns:
But there are performance implications detailed here:
How to set the value of a pandas column as list

It's not so short, but maybe can help you.
Creating Array
import numpy as np
import pandas as pd
data = np.array([['col1', 'col2'], [4.8, 2.8], [7.0, 1.2]])
>>> data
array([['col1', 'col2'],
['4.8', '2.8'],
['7.0', '1.2']], dtype='<U4')
Creating data frame
df = pd.DataFrame(i for i in data).transpose()
df.drop(0, axis=1, inplace=True)
df.columns = data[0]
df
>>> df
col1 col2
0 4.8 7.0
1 2.8 1.2

Related

querying a multiindex pandas dataframe with slices

Assuming I have the following multiindex DF
import pandas as pd
import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df
I know that I can query a multiindex DF as follows:
# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]
basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.
I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.
pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?
EDIT 1:
docId column index is of text type, probably its necessary to change it first to int
Turns out query is very powerful:
df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
Output:
content
input_id docType docId secType sec_ids
12345 pre 34455667 de x-y 2##9
z-k 6##1
pub 34455667 de x-y 6##5
z-k 9##8

Implement df.groupby('user')['item'].apply(np.array) in cuDF

Is there any way to replicate this simple pandas functionality to cuDF?
Note that array lengths are varying.
An example of the expected output using pandas and NumPy(CuPy in the cuDF case) be found below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user':[0,1,0,2,1], 'item':[1,2,3,4,5]})
res = df.groupby('user')['item'].apply(np.array)
res
# Output:
# user
# 0 [1, 3]
# 1 [2, 5]
# 2 [4]
# Name: item, dtype: object

How to apply a style to a Python DataFrame for all rows except the last one?

I am trying to apply a Bar Style to all of the data in the dataframe, except the last row, which is supposed to be the Total row.
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
data.loc['Total'] = data.sum()
A B
0 -1.224620 -0.373898
1 0.75568 0.997875
2 -1.284663 -0.211903
3 -0.274813 -0.871816
4 1.256267 -0.742521
Total -0.772143 -1.202263
It was explained in the docs, that
A tuple is treated as (row_indexer, column_indexer)
You just need to twist a bit the subset option.
On your data
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
data.loc['Total'] = data.sum()
data.style.bar(subset = ([0,1,2,3,4], ['A', 'B']))
it gives

how can select data of coefficient of 3 columns from csv file

I would like to plot amount of columns for 2 different scenario based on index of rows in my dataset preferably via Pandas.DataFrame :
1st scenario: columns index[2,5,8,..., n+2]
2nd scenario: the last 480 columns or column index [961-1439]
picture
I've tried to play with index of columns which is following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dft = pd.read_csv("D:\Test.csv" , header=None)
dft.head()
id_set = dft[dft.index % 2 == 0].astype('int').values
A = dft[dft.index % 2 == 1].values
B = dft[dft.index % 2 == 2].values
C = dft[dft.index % 2 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#1st scenario
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
print(index)
#2nd scenario
last_480 = df.[0:480][::-1]
I've found this post1 and post2 but they weren't my case!
I would appreciate if someone can help me.
1st scenario:
df.iloc[:, 2::3]
The slicing here means all rows, columns starting from the 2nd, and every 3 after that.
2nd scenario:
df.iloc[:, :961:-1]
The slicing here means all rows, columns to 961 from the end of the list.
EDIT:
import matplotlib.pyplot as plt
import seaborn as sns
senario1 = df.iloc[:, 2::3].copy()
sns.lineplot(data = senario1.T)
You can save the copy of the slice to another variable, then since you want to graph row-wise you need to take the transpose of the sliced matrix (This will make yours rows into columns).

How do I efficiently change a pd.Series of lists in a dataframe to a pd.Series of np.arrays

I have a PostgreSQL database that has data similar to:
date, character varying, character varying, integer[]
In the interger array column is stored a list of values: 1,2,3,4,5
I'm using pd.read_sql to read the data into a dataframe.
So I have a dataframe with a date column, several string columns, and then a column with a list of intergers.
The array values are regularly used in numpy arrays to do vector math.
In the past I couldn't find a way to convert the list column to a numpy array column without looping and recreating the dataframe row by row.
As an example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
new_df = pd.DataFrame(columns=df.columns)
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
print(new_df)
This looping could be over a few thousand rows.
More recently I figured out that if I could do a single line conversion of Series -> list -> nparray -> list -> Series and achieve the result a lot more efficiently.
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))
I read about and tried to use Series.array and Series.to_numpy, but they don't really achieve what I'm trying to do.
So, the question is:
Is there a method to convert a pd.Series of lists to a numpy array as I'm trying to do?
Is there any easier way to mass convert these lists to numpy arrays?
I was hoping for something like simple like:
df['NParray'] =np.asarray(df['Measures'])
df['NParray'] =np.array(df['Measures'])
df['NParray'] =df['Measures'].array
df['NParray'] =df['Measures'].to_numpy()
But these have different functions and do not work for my purpose.
------------Edited after testing------------------------------------------------
I setup a small test to see what the difference in timings and efficiency would be:
import pandas as pd
import numpy as np
def get_dataframe():
col1 = ['String data'] * 10000
col2 = [list(range(0,5000))] * 10000
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
return(df)
def old_looping(df):
new_df = pd.DataFrame(columns=df.columns)
starttime = pd.datetime.now()
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
endtime = pd.datetime.now()
duration = endtime - starttime
print('Looping', duration)
def series_transforms(df):
starttime = pd.datetime.now()
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Transforms', duration)
def use_apply(df):
starttime = pd.datetime.now()
df['Measures'] = df['Measures'].apply(np.array)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Apply', duration)
def run_test(tests):
for i in range(tests):
construct_df = get_dataframe()
old_looping(construct_df)
for i in range(tests):
construct_df = get_dataframe()
series_transforms(construct_df)
for i in range(tests):
construct_df = get_dataframe()
use_apply(construct_df)
run_test(5)
With 10,000 rows the results were:
Transforms 3.945816
Transforms 3.968821
Transforms 3.891866
Transforms 3.859437
Transforms 3.860590
Apply 4.218867
Apply 4.015742
Apply 4.046986
Apply 3.906360
Apply 3.890740
Looping 27.662418
Looping 27.814523
Looping 27.298895
Looping 27.565626
Looping 27.222970
Transforming through Series-List-NP Array-List-Series is negligibly faster than using Apply. Apply is definitely shorter code and possibly easier to understand.
Increasing the number of rows or array length will increase the times by the same magnitude.
Easiest might be to go with apply to convert to the np.array: df['Measures'].apply(np.array)
Full example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
display(df.Measures)
df['NParray'] = df['Measures'].apply(np.array)
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))

Categories

Resources