I have a dataframe like below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [12124,12124,5687,5687,7892],
'A': [np.nan,np.nan,3.05,3.05,np.nan],'B':[1.05,1.05,np.nan,np.nan,np.nan],'C':[np.nan,np.nan,np.nan,np.nan,np.nan],'D':[np.nan,np.nan,np.nan,np.nan,7.09]})
I want to get box plot of columns A, B, C, and D, where the redundant row values in each column needs to be counted once only. How do I accomplish that?
Because panda can only deal with the dataFrame that every column has same length as well as every row has same length. In other words, only frame-shape data could be process. If null values need to be counted only once, it may conflict the principles of "panda" package. Here is my suggestion: you could transform the dataframe into list .
The detailed code of transforming the dataFrame into list
Then you could try to plot the box plot from the list data and index column.
Related
So I have a dataframe that looks like this for example:
In this example, I need to split the dataframe into multiple dataframes based on the account_id(or arrays because I will convert it anyways). I want each account id (ab123982173 and bc123982173) to be either an individual data frame or array. Since the actual dataset is thousands of rows long, splitting into a temporary array in a loop was my original thought.
Any help would be appreciated.
you can get a subset of your dataframe.
Using your dataframe as example,
subset_dataframe = dataframe[dataframe["Account_ID"] == "ab123982173"]
Here is a link from the pandas documentation that has visual examples:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for
I'm trying to remove duplicate elements in a dataframe. This DataFrame comes from calculating the distance between a given list of geocoordinates. As you can see in the following DataFrame, the data is duplicated but I can't set the index to 'dist' because in other cases, the distance might be 0 or 1 (repeated) and then important data will be discarded.
import pandas as pd
df = pd.DataFrame({'Name_x':['a','b','c','d'],
'Name_y':['b','a','d','c'],
'Latitude_x':['lat_a','lat_b','lat_c','lat_d'],
'Longitude_x':['long_a','long_b','long_c','long_d'],
'Latitude_y':['lat_b','lat_a','lat_d','lat_c'],
'Longitude_y':['long_b','long_a','long_d','long_c'],
'dist':[0,0,1,1]})
df
In this case I would like to remain with the values Name_x: ['a','c'], Name_y['b','d'] with the corresponding geocoordinates: Latitude_x:['lat_a','lat_c'], Latitude_y:['lat_b','lat_d'], Longitude_x:['long_a','long_c'], Longitude_y: ['long_b','long_d'].
I'm not sure if you want this:
df['Name_x'].eq(df['Name_y'].shift()) # filter by equals for name
df.loc[df['Name_x'].eq(df['Name_y'].shift())] # Your "unique" rows
I have a pandas Series that contains key-value pairs, where the key is the name of a column in my pandas DataFrame and the value is an index in that column of the DataFrame.
For example:
Series:
Series
Then in my DataFrame:
Dataframe
Therefore, from my DataFrame I want to extract the value at index 12 from my DataFrame for 'A', which is 435.81 . I want to put all these values into another Series, so something like { 'A': 435.81 , 'AAP': 468.97,...}
My reputation is low so I can't post my images as images instead of links (can someone help fix this? thanks!)
I think this indexing is what you're looking for.
pd.Series(np.diag(df.loc[ser,ser.axes[0]]), index=df.columns)
df.loc allows you to index based on string indices. You get your rows given from the values in ser (first positional argument in df.loc) and you get your column location from the labels of ser (I don't know if there is a better way to get the labels from a series than ser.axes[0]). The values you want are along the main diagonal of the result, so you take just the diagonal and associate them with the column labels.
The indexing I gave before only works if your DataFrame uses integer row indices, or if the data type of your Series values matches the DataFrame row indices. If you have a DataFrame with non-integer row indices, but still want to get values based on integer rows, then use the following (however, all indices from your series must be within the range of the DataFrame, which is not the case with 'AAL' being 1758 and only 12 rows, for example):
pd.Series(np.diag(df.iloc[ser,:].loc[:,ser.axes[0]]), index=df.columns)
I have a csv file with 8 columns in it. I want to plot a graph between 2 columns using matplotlib. One of the columns has repetitive values. I want to take the mean of the values from the other column which has same corresponding value in the first column.
How can I do it?
This isn't really specific to matplotlib. Pandas has nice support for this kind of data mangling. Read your csv file into a Pandas dataframe:
import pandas as pd
df = pd.read_csv('data.csv')
Then, assuming the column you want to group by is named 'key' and the column whose values you want to take means of is named 'value', you can do:
grouped = df.groupby('key').mean()
grouped.plot('value')