I have a dataframe DF in the following format, first row is the column name
name val1 val2
A 1 2
B 3 4
How to convert to the following data frame format
name map
A {val1:1,val2:2}
B {val1:3,val2:4}
You can achieve it with to_dict() method of dataframe.
x['map']=x[['val1','val2']].to_dict(orient='records')
Related
Suppose I have the following dataframe called "df":
ID;Type
1;A
1;A
1;A
1;A
1;A
2;A
2;A
3;B
4;A
4;B
Now I want to sum up / group by this dataframe by ID and Type and export the result as a csv.
The result should look like:
ID;Type;Count
1;A;5
2;A;2
3;B;1
4;A;1
4;B;1
My code for summing up / group by is as follows:
import pandas as pd
frameexport=df.groupby(["ID", "Type"])["Type"].count()
print(frameexport)
ID Type
1 A 5
2 A 2
3 B 1
4 A 1
B 1
This is not exaclty what I hoped for. The last entry for ID 4 is not repeating the "4" again.
I try to export it to a csv:
frameexport.to_csv(r'C:\Path\file.csv', index=False, sep=";")
But it doesn't work, it just looks like this:
Type
5
2
1
1
1
Problem here is that I call it on an object and obviously to_csv doesn't work like this.
How can I get the desired result:
ID Type Count
1 A 5
2 A 2
3 B 1
4 A 1
4 B 1
exported to a csv?
There is MultiIndex Serie, so need remove index=False and for correct column name is used rename:
frameexport.rename('Count').to_csv(r'C:\Path\file.csv', sep=";")
Or create DataFrame by Series.reset_index, then default index is necessary omit, so index=False is correct:
frameexport.reset_index(name='Count').to_csv(r'C:\Path\file.csv', index=False, sep=";")
Premise
I need to use a dictionary as a filter on a large dataframe, where the key-value pairs are values in different columns.
This dictionary is obtained from a separate dataframe, using dict(zip(df.id_col, df.rank_col)) so if a dictionary isn't the best way to go, that is open to change.
This is very similar to this question: Filter a pandas dataframe using values from a dict but fundamentally (I think) different because my dictionary contains column-paired values:
Example data
df_x = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3],
'B':[1,1,1,0,1,0,1,0,1], 'Rank':['1','2','3','1', '2','3','1','2','3'],'D':[1,2,3,4,5,6,7,8,9]})
filter_dict = {'1':'1', '2':'3', '3':'2'}
For this dataframe df_x I would want to be able to look at the filter dictionary and apply it to a set of columns, here id and Rank, so the dataframe is pared down to:
The actual source dataframe is approx 1M rows, and the dictionary is >100 key-value pairs.
Thanks for any help.
You can check with isin
df_x[df_x[['id','Rank']].astype(str).apply(tuple,1).isin(filter_dict.items())]
Out[182]:
id B Rank D
0 1 1 1 1
5 2 0 3 6
7 3 0 2 8
I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']
df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])
After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5
Given the following list:
list=['a','b','c']
I'd like to create a data frame where the list is the column of values.
I'd like the header to be "header".
Like this:
header
a
b
c
Thanks in advance!
Wouldn't that be:
list=['a','b','c']
df= pd.DataFrame({'header': list})
header
0 a
1 b
2 c