Merge multiple int columns/rows into one numpy array (pandas dataframe) - python

I have a pandas dataframe with few columns and rows. I want to merge the columns into one and then merge the rows based on id and date into one.
Currently I am doing so by:
df['matrix'] = df[[col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48]].values.tolist()
df = df.groupby(['id','date'])['matrix'].apply(list).reset_index(name='matrix')
This gives me the matrix in form of a list.
Later I convert it into numpy.ndarray using:
df['matrix'] = df['matrix'].apply(np.array)
This is a small segment of my dataset for reference:
id,date,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48
16,2014-06-22,0,0,0,10,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,2,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,3,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0
16,2014-06-22,4,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0
Though the above piece of code works fine for small datasets, but sometimes crashes for larger ones. Specifically df['matrix'].apply(np.array) statement.
Is there a way by which I can perform the merging to fetch me a numpy.array? This would save a lot of time.

No need to merge the columns at first. Split DataFrame using groupby and then flatten the result
matrix=df.set_index(['id','date']).groupby(['id','date']).apply(lambda x: x.values.flatten())

Related

How to split dataframe or array by unique column value with multiple unique values

So I have a dataframe that looks like this for example:
In this example, I need to split the dataframe into multiple dataframes based on the account_id(or arrays because I will convert it anyways). I want each account id (ab123982173 and bc123982173) to be either an individual data frame or array. Since the actual dataset is thousands of rows long, splitting into a temporary array in a loop was my original thought.
Any help would be appreciated.
you can get a subset of your dataframe.
Using your dataframe as example,
subset_dataframe = dataframe[dataframe["Account_ID"] == "ab123982173"]
Here is a link from the pandas documentation that has visual examples:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

python pandas difference between df_train["x"] and df_train[["x"]]

I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers

Merging data from python list into one dataframe

I have the following files in AAMC_K.txt, AAU.txt, ACU.txt, ACY.txt in a folder called AMEX. I am trying to merge these text files into one dataframe. I have tried to do so with pd.merge() but I get an error that the merge function needs a right and left parameter and my data is in a python list. How can I merge the data in the data_list into one pandas dataframe.
import pandas as pd
import os
textfile_names = os.listdir("AMEX")
textfile_names.sort()
data_list = []
for i in range(len(textfile_names)):
data = pd.read_csv("AMEX/"+textfile_names[i], index_col=None, header=0)
data_list.append(data)
frame = pd.merge(data_list, on='<DTYYYYMMDD>', how='outer')
"AE.txt"
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
AAU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAU,D,20020513,000000,0.4220,0.4220,0.4220,0.4220,0,0
AAU,D,20020514,000000,0.4177,0.4177,0.4177,0.4177,0,0
ACU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0
ACY.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACY,D,19980116,000000,9.7500,9.7500,8.8125,8.8125,289,0
ACY,D,19980120,000000,8.7500,8.7500,8.1250,8.1250,151,0
I want the output to be filtered with the DTYYYYMMDD and put into one dataframe frame.
OUTPUT
<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>,<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0,AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0,AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
As #busybear says, pd.concat is the right tool for this job: frame = pd.concat(data_list).
merge is for when you're joining two dataframes which usually have some of the same columns and some different ones. You choose a column (or index or multiple) which identifies which rows in the two dataframes correspond to each other, and pandas handles making a dataframe whose rows are combinations of the corresponding rows in the two original dataframes. This function only works on 2 dataframes at a time; you'd have to do a loop to merge more in (it's uncommon to need to merge many dataframes this way).
concat is for when you have multiple dataframes and want to just append all of their rows or columns into one large dataframe. (Let's assume you're concatenating rows, as you want here.) It doesn't use an identifier to determine which rows correspond. All it does is create a new dataframe which has each row from each of the concated dataframes (all the rows from the first, then all from the second, etc.).
I think the above is a decent TLDR on merge vs concat but see here for a lengthy but much more comprehensive guide on using merge/join/concat with dataframes.

Pandas: after slicing along specific columns, get "values" without returning entire dataframe

Here is what is happening:
df = pd.read_csv('data')
important_region = df[df.columns.get_loc('A'):df.columns.get_loc('C')]
important_region_arr = important_region.values
print(important_region_arr)
Now, here is the issue:
print(important_region.shape)
output: (5,30)
print(important_region_arr.shape)
output: (5,30)
print(important_region)
output: my columns, in the panda way
print(important_region_arr)
output: first 5 rows of the dataframe
How, having indexed my columns, do I transition to the numpy array?
Alternatively, I could just convert to numpy from the get-go and run the slicing operation within numpy. But, how is this done in pandas?
So here is how you can slice the dataset with specific columns. loc gives you access to the grup of rows and columns. The ones before , represents rows and columns after. If a : is specified it means all the rows.
data.loc[:,'A':'C']
For more understanding, please look at the documentation.

pandas: Select one-row data frame instead of series [duplicate]

I have a huge dataframe, and I index it like so:
df.ix[<integer>]
Depending on the index, sometimes this will have only one row of values. Pandas automatically converts this to a Series, which, quite frankly, is annoying because I can't operate on it the same way I can a df.
How do I either:
1) Stop pandas from converting and keep it as a dataframe ?
OR
2) easily convert the resulting series back to a dataframe ?
pd.DataFrame(df.ix[<integer>]) does not work because it doesn't keep the original columns. It treats the <integer> as the column, and the columns as indices. Much appreciated.
You can do df.ix[[n]] to get a one-row dataframe of row n.

Categories

Resources