I am trying to use a data frame to regroup different kind of data.
I have a data frame with 3 columns :
one that I define and the index (used a groupby command)
one that regroups a parameter, say 'valeur1', for which I want a mean for these that have the same index (used a mean command after the group by)
the last column contains strings. There is only 1 string for each index but some cell might contain nan.
I am trying to get in the end a dataframe with the mean for 1 parameter depending on the index as well as the string that goes with the index (nan in the string column are not important). Here is a picture with an example or what I am trying to get : illustration . Main issue is that dataframe.mean does not work with string
The code I used so far is pretty basic :
dataRaw=pd.read_csv('file.csv', sep=';', encoding='latin-1')
data=dataRaw.groupby(index)
databis=data.mean();
Any suggestion would be greatly appreciated.
Thanks !
I think you need to group by multiple columns:
databis = dataRaw.groupby(['index', 'String']).mean()
Related
I have a dataframe that has "?" in the data.This is a pic of said df I had to calculate the values that would go where these ? are in another part of the assignment.values that need to be added. I need a way to add these values into a SINGLE column in a pandas dataframe. I need it to add the data in the order that it is in the first picture (even though the indices obviously do not match) in order of the "?" down the larger column in the main dataframe.
I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])
I'm new to Python and coding in general. I am attempting to automate the processing of some groundwater model output data in python. One pandas dataframe has measured stream flow with multiple columns of various types (left), the other has modeled stream flow (right). I've attempted to use pd.merge on column "Name" in order to link the correct modeled output value to the corresponding measured site value. When I use the following script I get the corresponding error:
left = measured_df
right = modeled_df
combined_df = pd.merge(left, right, on= 'Name')
ValueError: The column label 'Name' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
The modeled data for each stream starts out as a numpy array (not sure about the dtype)
array(['silver_drn', '24.681524615195002'], dtype='<U18')
I then use np.concatenate to combine the 6 stream outputs into one array:
modeled = np.concatenate([[blitz_drn],[silvies_ss_drn],[silvies_drn],[bridge_drn],[krumbo_drn], [silver_drn]])
Then pd.DataFrame to create a pandas data frame with a column header:
modeled_df = pd.DataFrame(data=modeled, columns= [['Name','Modeled discharge (CFS)']])
See image links below to see how each dataframe looks (not sure the best way to share just yet).
left =
right =
Perhaps I'm misunderstanding how pd.merge works,or maybe the datatypes are different even if they appear to be text, but figured if each column was a string, it would append the modeled output to the corresponding row where the "Name" matches within each dataframe. Any help would be greatly appreciated.
When you do this:
modeled_df = pd.DataFrame(data=modeled,
columns= [['Name','Modeled discharge (CFS)']])
you create a MultiIndex on the columns. And that MultiIndex is trying to be merged with a DataFrame with a normal index which doesn't work as you might expect.
You should instead do:
modeled_df = pd.DataFrame(data=modeled,
columns=['Name','Modeled discharge (CFS)'])
# ^ ^
Then the merge should work as expected.
I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')
I have a data frame df with a column called "Num_of_employees", which has values like 50-100, 200-500 etc. I see a problem with few values in my data. Wherever the employee number should be 1-10, the data has it as 10-Jan. Also, wherever the value should be 11-50, the data has it as Nov-50. How would I rectify this problem using pandas?
A clean syntax for this kind of "find and replace" uses a dict, as
df.Num_of_employees = df.Num_of_employees.replace({"10-Jan": "1-10",
"Nov-50": "11-50"})