How can I make this table:
into a Pandas data frame? Can't make that Machine Column.
You can't really do that in a dataframe, as you can't have a one-level index combined with a multi-level index on the same axis.
One way to get as close as possible to what you want is to concatenate individual pandas series for the first one-level columns with a two-level dataframe for the 'machine' columns like follows:
pd.concat({
'Company name': pd.Series(['a', 'b', 'c']),
'Number of machines': pd.Series([1, 4, 2]),
'Machines': pd.DataFrame({
'2015-2020': pd.Series([3, 1, 0]),
'2018-2014': pd.Series([1, 8, 3]),
'Other': pd.Series([5, 0, 4]),
})
}, axis=1)
You will still a two-level index as a result, and the first columns will have a 2nd level integer index (0, 1 etc.)
Thank you. My boss asked me to make some process in file and show it to him in excel file like i posted here. (its just example but columns have to be exactly like it)
xlsx
Related
I have two dataframes of different shape
The 'ANTENNA1' and 'ANTENNA2' in the bigger dataframe correspond to the ID columns in the smaller dataframe. I want to create merge the smaller dataframe to the bigger one so that the bigger dataframe will have '(POSITION, col1)', '(POSITION, col2)', '(POSITION, col3)' according to ANTENNA1 == ID
Edit: I tried with pd.merge but it is changing the original dataframe column values
Original:
df = pd.merge(df_main, df_sub, left_on='ANTENNA1', right_on ='id', how = 'left')
Result:
I want to keep the original dataframe columns as it is.
Assuming your first dataframe (with positions) is called df1, and the second is called df2, with your loaded data, you could just use pandas.DataFrame.merge: ( -> pd.merge(...) )
df = pd.merge(df1,df2,left_on='id', right_on='ANTENNA1')
Than you might select the df on your needed columns(col1,col2,..) to get the desired result df[["col1","col2",..]].
simple example:
# import pandas as pd
import pandas as pd
# creating dataframes as df1 and df2
df1 = pd.DataFrame({'ID': [1, 2, 3, 5, 7, 8],
'Name': ['Sam', 'John', 'Bridge',
'Edge', 'Joe', 'Hope']})
df2 = pd.DataFrame({'id': [1, 2, 4, 5, 6, 8, 9],
'Marks': [67, 92, 75, 83, 69, 56, 81]})
# merging df1 and df2 by ID
# i.e. the rows with common ID's get
# merged i.e. {1,2,5,8}
df = pd.merge(df1, df2, left_on="ID", right_on="id")
print(df)
This is my task:
Write a function that accepts a dataframe as input, the name of the column with missing values , and a list of grouping columns and returns the dataframe by filling in missing values with the median value
Here is that I tried to do:
def fillnull(set,col):
val = {col:set[col].sum()/set[col].count()}
set.fillna(val)
return set
fillnull(titset,'Age')
My problem is that my function doesn't work, also I don't know how to count median and how to group through this function
Here are photos of my dataframe and missing values of my dataset
DATAFRAME
NaN Values
Check does this code works for you
import pandas as pd
df = pd.DataFrame({
'processId': range(100, 900, 100),
'groupId': [1, 1, 2, 2, 3, 3, 4, 4],
'other': [1, 2, 3, None, 3, 4, None, 9]
})
print(df)
def fill_na(df, missing_value_col, grouping_col):
values = df.groupby(grouping_col)[missing_value_col].median()
df.set_index(grouping_col, inplace=True)
df[missing_value_col].fillna(values, inplace=True)
df.reset_index(grouping_col, inplace=True)
return df
fill_na(df, 'other', 'groupId')
Assuming I have a data frame similar to the below (actual data frame has million observations), how would I get the correlation between signal column and list of return columns, then group by the Signal_Up column?
I tried the pandas corrwith function but it does not give me the correlation grouping for the signal_up column
df[['Net_return_at_t_plus1', 'Net_return_at_t_plus5',
'Net_return_at_t_plus10']].corrwith(df['Signal_Up']))
I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
Data and desired result is given below.
Desired Result
Data
Using simple dataframe below:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 6, 7],
'v2': [2, 2, 4, 2, 4, 4],
'v3': [3, 3, 2, 9, 2, 5],
'v4': [4, 5, 1, 4, 2, 5]})
(1st interpretation) one way to get correlations of one variable with the other columns is:
correlations = df.corr().unstack().sort_values(ascending=False) # Build correlation matrix
correlations = pd.DataFrame(correlations).reset_index() # Convert to dataframe
correlations.columns = ['col1', 'col2', 'correlation'] # Label it
correlations.query("col1 == 'v2' & col2 != 'v2'") # Filter by variable
# output of this code will give correlation of column v2 with all the other columns
(2nd interpretation) one way to get correlations of column v1 with column v3, v4 after grouping by column v2 is using this one line:
df.groupby('v2')[['v1', 'v3', 'v4']].corr().unstack()['v1']
In your case, v2 is 'Signal_Up', v1 is 'signal' and v3, v4 columns proxy 'Net_return_at_t_plusX' columns.
I am able to get the correlations by individual category of Signal_Up column by using “groupby” function. However, I am not able to apply “corr” function to more than two columns.
So, I had to use “concat” function to combine all of them.
a = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus1']].corr().unstack().iloc[:,1]
b = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus5']].corr().unstack().iloc[:,1]
c = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus10']].corr().unstack().iloc[:,1]
dfCorr = pd.concat([a, b, c], axis=1)
When there is a DataFrame like the following:
import pandas as pd
df = pd.DataFrame(1, index=[100, 29, 234, 1, 150], columns=['A'])
How can I sort this dataframe by index with each combination of index and column value intact?
Dataframes have a sort_index method which returns a copy by default. Pass inplace=True to operate in place.
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df.sort_index(inplace=True)
print(df.to_string())
Gives me:
A
1 4
29 2
100 1
150 5
234 3
Slightly more compact:
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df = df.sort_index()
print(df)
Note:
sort has been deprecated, replaced by sort_index for this scenario
preferable not to use inplace as it is usually harder to read and prevents chaining. See explanation in answer here:
Pandas: peculiar performance drop for inplace rename after dropna
If the DataFrame index has name, then you can use sort_values() to sort by the name as well. For example, if the index is named lvl_0, you can sort by this name. This particular case is common if the dataframe is obtained from a groupby or a pivot_table operation.
df = df.sort_values('lvl_0')
If the index has name(s), you can even sort by both index and a column value. For example, the following sorts by both the index and the column A values:
df = df.sort_values(['lvl_0', 'A'])
If you have a MultiIndex dataframe, then, you can sort by the index level by using the level= parameter. For example, if you want to sort by the second level in descending order and the first level in ascending order, you can do so by the following code.
df = df.sort_index(level=[1, 0], ascending=[False, True])
If the indices have names, again, you can call sort_values(). For example, the following sorts by indexes 'lvl_1' and 'lvl_2'.
df = df.sort_values(['lvl_1', 'lvl_2'])
When there is a DataFrame like the following:
import pandas as pd
df = pd.DataFrame(1, index=[100, 29, 234, 1, 150], columns=['A'])
How can I sort this dataframe by index with each combination of index and column value intact?
Dataframes have a sort_index method which returns a copy by default. Pass inplace=True to operate in place.
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df.sort_index(inplace=True)
print(df.to_string())
Gives me:
A
1 4
29 2
100 1
150 5
234 3
Slightly more compact:
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df = df.sort_index()
print(df)
Note:
sort has been deprecated, replaced by sort_index for this scenario
preferable not to use inplace as it is usually harder to read and prevents chaining. See explanation in answer here:
Pandas: peculiar performance drop for inplace rename after dropna
If the DataFrame index has name, then you can use sort_values() to sort by the name as well. For example, if the index is named lvl_0, you can sort by this name. This particular case is common if the dataframe is obtained from a groupby or a pivot_table operation.
df = df.sort_values('lvl_0')
If the index has name(s), you can even sort by both index and a column value. For example, the following sorts by both the index and the column A values:
df = df.sort_values(['lvl_0', 'A'])
If you have a MultiIndex dataframe, then, you can sort by the index level by using the level= parameter. For example, if you want to sort by the second level in descending order and the first level in ascending order, you can do so by the following code.
df = df.sort_index(level=[1, 0], ascending=[False, True])
If the indices have names, again, you can call sort_values(). For example, the following sorts by indexes 'lvl_1' and 'lvl_2'.
df = df.sort_values(['lvl_1', 'lvl_2'])