Suppose I had data with 12 columns the following would get me those 12 columns.
train_data = np.asarray(pd.read_csv(StringIO(train_data), sep=',', header=None))
inputs = train_data[:, :12]
However, lets say I want a subset of these columns (not all of them).
If I had a list
a=[1,5,7,10]
is there a smart way I can pass "a" so that I get a new dataframe whose columns will reflect the entries of "a" i.e first column of new data frame is the first column of the big dataframe, then next column in the new dataframe is the 5th column in the big dataframe, etc.
Thank you.
Related
I have a csv pandas pivot table that I want to normalize. Every column has to be divided by the values in the 9:30 column so that the data scales up to "1".
Here is what I tried:
columns = table.columns
for col in table[columns]:
table[col] = table[col]/table[columns[0]]
"table" is the pivot table dataframe.
I put the column names into "columns", because they are datetime format.
The issue is that since the first column is divided, all the values turn to 1 (which is what I want), but then all the other columns stay the same, for what I can only assume is because the rest of the columns are being divided by "1", and not the original values.
So what I'm thinking of is to isolate the 9:30 column and create a separate copy of it, in order to use it for the division and so it doesn't change the original dataframe.
You can save the data with .copy():
divisor = table.iloc[:,0].copy()
for col in columns:
table[col] /= divisor
However, in your case, you can just do div:
table = table.div(table.iloc[:,0],axis=0)
I have a dataframe with columns that are a string of blanks (null/nan set to 0) with sporadic number values.
I am tying to compare the last two non-zero values in a data frame column.
Something like :
df['Column_c'] = df[column_a'].last_non_zero_value > df[column_a'].second_to_last_non_zero_value
This is what the columns look like in excel
You could drop all the rows with missing data using pd.df.dropna() and then access the last row in the dataframe index and have it return the values as an array which should be easy to find the last two elements in.
If I have two dataframes. Dataframe A have an a_id column, dataframe B have an b_id column and a b_value column. How can I join A and B on a_id = b_id and get C with id and max(b_value)?
enter image description here
you can use the concat function in pandas to append either columns or rows from one DataFrame to another.
heres an example:
# Read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
# Grab the last 10 rows
survey_sub_last10 = surveys_df.tail(10)
# Reset the index values to the second dataframe appends properly
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values
When i concatenate DataFrames, i need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. To stack the data vertically, i need to make sure we have the same columns and associated column format in both datasets. When i stack horizontally, i want to make sure what i am doing makes sense (i.e. the data are related in some way).
First, I haven't found this asked before - probably because I'm not using the right words to ask it. So if it has been asked, please send me in that direction.
How can I combine two pandas data frames based on column AND row. My main dataframe has a column 'years' and a column 'county' among others. Ideally, I want to add another column 'percent' from the second data frame below.
For example, I have this image of my first df:
and I have another data frame with the same 'year' column and every other column name is a string value in the original "main" dataframe's 'county' column:
How can I combine these two data frames in a way that adds another column to the 'main df'? It would be helpful to first put the second data frame in the format where there are three columns: 'year', 'county', and 'percent'. If anyone can help me with this part, I can merge it.
I think what you will want to do is transform the second dataframe to have a row for each year/county combination and then you can use a left join to combine the two. I believe the ```melt`` method will do this transformation. Try this:
melted_second_df = second_df.melt(id_vars=["year"], var_name="county", value_name="percent")
combined_df = first_df.merge(
right=melted_second_df,
on=["year", "county"],
how="left"
)
I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers