I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for
Related
So I have a dataframe that looks like this for example:
In this example, I need to split the dataframe into multiple dataframes based on the account_id(or arrays because I will convert it anyways). I want each account id (ab123982173 and bc123982173) to be either an individual data frame or array. Since the actual dataset is thousands of rows long, splitting into a temporary array in a loop was my original thought.
Any help would be appreciated.
you can get a subset of your dataframe.
Using your dataframe as example,
subset_dataframe = dataframe[dataframe["Account_ID"] == "ab123982173"]
Here is a link from the pandas documentation that has visual examples:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])
I'm new to Python and coding in general. I am attempting to automate the processing of some groundwater model output data in python. One pandas dataframe has measured stream flow with multiple columns of various types (left), the other has modeled stream flow (right). I've attempted to use pd.merge on column "Name" in order to link the correct modeled output value to the corresponding measured site value. When I use the following script I get the corresponding error:
left = measured_df
right = modeled_df
combined_df = pd.merge(left, right, on= 'Name')
ValueError: The column label 'Name' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
The modeled data for each stream starts out as a numpy array (not sure about the dtype)
array(['silver_drn', '24.681524615195002'], dtype='<U18')
I then use np.concatenate to combine the 6 stream outputs into one array:
modeled = np.concatenate([[blitz_drn],[silvies_ss_drn],[silvies_drn],[bridge_drn],[krumbo_drn], [silver_drn]])
Then pd.DataFrame to create a pandas data frame with a column header:
modeled_df = pd.DataFrame(data=modeled, columns= [['Name','Modeled discharge (CFS)']])
See image links below to see how each dataframe looks (not sure the best way to share just yet).
left =
right =
Perhaps I'm misunderstanding how pd.merge works,or maybe the datatypes are different even if they appear to be text, but figured if each column was a string, it would append the modeled output to the corresponding row where the "Name" matches within each dataframe. Any help would be greatly appreciated.
When you do this:
modeled_df = pd.DataFrame(data=modeled,
columns= [['Name','Modeled discharge (CFS)']])
you create a MultiIndex on the columns. And that MultiIndex is trying to be merged with a DataFrame with a normal index which doesn't work as you might expect.
You should instead do:
modeled_df = pd.DataFrame(data=modeled,
columns=['Name','Modeled discharge (CFS)'])
# ^ ^
Then the merge should work as expected.
I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers
I have a nested list of coordinates:
I need my list to be in the format of rows and columns shown below (I think it is called a data frame), with its contents having the Pythagorean formula applied against each cells' column and row header:
What is the best approach in Python to do it?
If I understand correcty, this should solve your problem:
import pandas as pd
df = pd.DataFrame(coor_house)
df['l2'] = np.sqrt((df[1].apply(lambda x :x[0])
-df[0].apply(lambda x :x[0]))**2
+(df[1].apply(lambda x :x[1])
-df[0].apply(lambda x :x[1]))**2)
This will create a dataframe where each column is a point and a column with the l2 norm of the difference.
I'm not very used to apply a function to a whole dataframe, so I'm sure there is a better way.