python pandas difference between df_train["x"] and df_train[["x"]] - python

I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?

In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers

Related

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Merge multiple int columns/rows into one numpy array (pandas dataframe)

I have a pandas dataframe with few columns and rows. I want to merge the columns into one and then merge the rows based on id and date into one.
Currently I am doing so by:
df['matrix'] = df[[col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48]].values.tolist()
df = df.groupby(['id','date'])['matrix'].apply(list).reset_index(name='matrix')
This gives me the matrix in form of a list.
Later I convert it into numpy.ndarray using:
df['matrix'] = df['matrix'].apply(np.array)
This is a small segment of my dataset for reference:
id,date,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48
16,2014-06-22,0,0,0,10,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,2,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,3,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0
16,2014-06-22,4,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0
Though the above piece of code works fine for small datasets, but sometimes crashes for larger ones. Specifically df['matrix'].apply(np.array) statement.
Is there a way by which I can perform the merging to fetch me a numpy.array? This would save a lot of time.
No need to merge the columns at first. Split DataFrame using groupby and then flatten the result
matrix=df.set_index(['id','date']).groupby(['id','date']).apply(lambda x: x.values.flatten())

I have pandas dataframe which i would like to be sliced after every 4 columns

I have pandas dataframe which i would like to be sliced after every 4 columns and then vertically stacked on top of each other which includes the date as index.Is this possible by using np.vstack()? Thanks in advance!
ORIGINAL DATAFRAME
Please refer the image for the dataframe.
I want something like this
WANT IT MODIFIED TO THIS
Until you provide a Minimal, Complete, and Verifiable example, I will not test this answer but the following should work:
given that we have the data stored in a Pandas DataFrame called df, we can use pd.melt
moltendfs = []
for i in range(4):
moltendfs.append(df.iloc[:, i::4].reset_index().melt(id_vars='date'))
newdf = pd.concat(moltendfs, axis=1)
We use iloc to take only every fourth column, starting with the i-th column. Then we reset_index in order to be able to keep the date column as our identifier variable. We use melt in order to melt our DataFrame. Finally we simply concatenate all of these molten DataFrames together side by side.

pandas: Select one-row data frame instead of series [duplicate]

I have a huge dataframe, and I index it like so:
df.ix[<integer>]
Depending on the index, sometimes this will have only one row of values. Pandas automatically converts this to a Series, which, quite frankly, is annoying because I can't operate on it the same way I can a df.
How do I either:
1) Stop pandas from converting and keep it as a dataframe ?
OR
2) easily convert the resulting series back to a dataframe ?
pd.DataFrame(df.ix[<integer>]) does not work because it doesn't keep the original columns. It treats the <integer> as the column, and the columns as indices. Much appreciated.
You can do df.ix[[n]] to get a one-row dataframe of row n.

Categories

Resources