Convert a pandas df column to dictinary values with a dictionary key - python

import pandas as pd
df = pd.DataFrame({'Name': ['John', 'John', 'Sara', 'Sara', 'Sara', 'Peter', 'Peter'],
'Age': [11, 22, 33, 44, 55, 66, 77]})
Assume that I have a data frame which is given above. My goal is to convert this data frame to the following dictionary format below. Does anybody know a convenient way to solve this problem? Thanks in advance.
# Expected Output:
out_df = {'John':[11, 22], 'Sara': [33, 44, 55], 'Peter': [66, 77]}

First aggregate list and then convert Series to dictionary:
d = df.groupby('Name')['Age'].agg(list).to_dict()

Related

How to save each column of a data frame into separate sheets in one excel file

I have the below pandas dataframe:
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)
How can I save each column as a separate sheet in one excel file. So, the excel file would consist of two sheets. I am looking for a general code which can be applied to other dataframes with many number of columns as well. I assume a for loop can resolve this issue.
You can do this:
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)
with pd.ExcelWriter('output.xlsx') as writer:
for col in df.columns:
df[col].to_excel(writer, sheet_name=col, index=False)
df.columns is a list of column names
sheet_name = column name

Python - Pandas - how to extract content of specific row without header column

I would like to get the content of specific row without header column , I'm going to use df.iloc[row number] , but it didn't give me an expected result ?
my code as below:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2]
The result i get is:
first_name Marry
last_name Jackson
age 37
Name: 2, dtype: object
I expected it could give me a list , sth like below:
['Marry', 'Jackson','37']
Any idea to do this, could you please advise for my case?
Well there are many functions in pandas that could help you do this. to_String() or values are a few among them.
So if you do something like
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2].to_String()
print(df_temp)
you will get an output like this for your given code:
first_name Marry
last_name Jackson
age 37
however in your case because you want a list you can just call values and get it as you want. Here's your updated code below:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2].values
print(df_temp)
which will give you the output you probably want as
['Marry' 'Jackson' 37]

How to split pandas dataframe into list of dataframes by id?

I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?
Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]
Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]
I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]
If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3

Converting a data dictionary to numpy array in python

im trying to convert a data dictionary with a structure as below:
{'name': array(['Ben','Sean,'Fred'])
'age': array([22, 16, 35]),
'marks': array([98, 75, 60]),
'result': array('HD','D','C')}
I need to then filter out the dictionary to only include name, mark and result in the new numpy array to be able to plot on a graph (i can do this but cant for the life of me filter the list and then convert to numpy)
Let's assume your dictionary is something like this.
dict = {
'name': ['Ben','Sean','Fred'],
'age': [22, 16, 35],
'marks': [98, 75, 60],
'result': ['HD','D','C']
}
You can iterate over the dictionary to get desired values and append them into a list. Then convert it into a NumPy array. Here I am using all of the keys
name, age, marks, result
but you can filter some keys if you like.
if key not in ['age']:
import numpy as np
data_list = []
for key, val in dict.items():
data_list.append(val)
numpy_array = np.array(data_list)
transpose = numpy_array.T
transpose_list = transpose.tolist()
The end result will be following:
[['Ben', '22', '98', 'HD'],
['Sean', '16', '75', 'D'],
['Fred', '35', '60', 'C']]
You can try pandas
import pandas as pd
d = {
'name': ['Ben','Sean','Fred'],
'age': [22, 16, 35],
'marks': [98, 75, 60],
'result': ['HD','D','C']
}
df = pd.DataFrame(d)
result = df[['name', 'marks', 'result']].T.values
print(type(result))
print(result)
<class 'numpy.ndarray'>
[['Ben' 'Sean' 'Fred']
[98 75 60]
['HD' 'D' 'C']]

Print the colname and rowname for values that meet certain condition

I am desperatly trying to figure out how to print out the row index and col name for specific values in my df.
I have the following df:
raw_data = {'first_name': [NaN, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, NaN, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore','postTestScore'])
I now want to print out the index and column name for the NaN:
There is a missing value in row 0 for first_name.
There is a missing value in row 2 for age.
I have searched a lot and always found how to do something for one row.
My idea is to first create a df with False and True
na = df.isnull()
Then I want to apply some function that prints the row number and col_name for every NaN value.
I just cant figure out how to do this.
Thanks in advance for any help!
had to change the df a bit because of NaN. Replaced with np.nan
import numpy as np
import pandas as pd
raw_data = {'first_name': [np.nan, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, np.nan, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
you can do this
dfs = df.stack(dropna = False)
[f'There is a missing value in row {i[0]} for {i[1]}' for i in dfs[dfs.isna()].index]
prints a list
['There is a missing value in row 0 for first_name',
'There is a missing value in row 2 for age']
As simple as:
np.where(df.isnull())
It returns a tuple with the row indexes, and column indexes with NAs, respectively.
Example:
na_idx = np.where(df.isnull())
for i,j in zip(*na_idx):
print(f'Row {i} and column {j} ({df.columns[j]}) is NA.')
You could do something like the below:
for i, row in df.iterrows():
nans = row[row.isna()].index
for n in nans:
print('row: %s, col: %s' % (i, n))
I think melting is the way to go.
I'd start by creating a dataframe with columns: index, column_name, value.
Then filter column value by not null.
And dump the result to dict.
df = pd.melt(df.reset_index(), id_vars=['index'], value_vars=df.columns)
selected = df[df['value'].isnull()].drop('value', axis=1).set_index('index')
resp = selected.T.to_dict(orient='records')[0]
s = "There is a missing value in row {idx} for {col_name}."
for record in resp.items():
idx, col_name = record
print(s.format(idx=idx, col_name=col_name))
you can just create a variable
NaN = "null"
to indicate empty cell
import pandas as pd
NaN = "null"
raw_data = {'first_name': [NaN, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, NaN, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore','postTestScore'])
print(df)
output:
first_name last_name age preTestScore postTestScore
0 null Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali null 31 57
3 Jake Milner 24 33 62
4 Amy Cooze 73 3 70

Categories

Resources