I am desperatly trying to figure out how to print out the row index and col name for specific values in my df.
I have the following df:
raw_data = {'first_name': [NaN, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, NaN, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore','postTestScore'])
I now want to print out the index and column name for the NaN:
There is a missing value in row 0 for first_name.
There is a missing value in row 2 for age.
I have searched a lot and always found how to do something for one row.
My idea is to first create a df with False and True
na = df.isnull()
Then I want to apply some function that prints the row number and col_name for every NaN value.
I just cant figure out how to do this.
Thanks in advance for any help!
had to change the df a bit because of NaN. Replaced with np.nan
import numpy as np
import pandas as pd
raw_data = {'first_name': [np.nan, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, np.nan, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
you can do this
dfs = df.stack(dropna = False)
[f'There is a missing value in row {i[0]} for {i[1]}' for i in dfs[dfs.isna()].index]
prints a list
['There is a missing value in row 0 for first_name',
'There is a missing value in row 2 for age']
As simple as:
np.where(df.isnull())
It returns a tuple with the row indexes, and column indexes with NAs, respectively.
Example:
na_idx = np.where(df.isnull())
for i,j in zip(*na_idx):
print(f'Row {i} and column {j} ({df.columns[j]}) is NA.')
You could do something like the below:
for i, row in df.iterrows():
nans = row[row.isna()].index
for n in nans:
print('row: %s, col: %s' % (i, n))
I think melting is the way to go.
I'd start by creating a dataframe with columns: index, column_name, value.
Then filter column value by not null.
And dump the result to dict.
df = pd.melt(df.reset_index(), id_vars=['index'], value_vars=df.columns)
selected = df[df['value'].isnull()].drop('value', axis=1).set_index('index')
resp = selected.T.to_dict(orient='records')[0]
s = "There is a missing value in row {idx} for {col_name}."
for record in resp.items():
idx, col_name = record
print(s.format(idx=idx, col_name=col_name))
you can just create a variable
NaN = "null"
to indicate empty cell
import pandas as pd
NaN = "null"
raw_data = {'first_name': [NaN, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, NaN, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore','postTestScore'])
print(df)
output:
first_name last_name age preTestScore postTestScore
0 null Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali null 31 57
3 Jake Milner 24 33 62
4 Amy Cooze 73 3 70
Related
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'John', 'Sara', 'Sara', 'Sara', 'Peter', 'Peter'],
'Age': [11, 22, 33, 44, 55, 66, 77]})
Assume that I have a data frame which is given above. My goal is to convert this data frame to the following dictionary format below. Does anybody know a convenient way to solve this problem? Thanks in advance.
# Expected Output:
out_df = {'John':[11, 22], 'Sara': [33, 44, 55], 'Peter': [66, 77]}
First aggregate list and then convert Series to dictionary:
d = df.groupby('Name')['Age'].agg(list).to_dict()
I would like to get the content of specific row without header column , I'm going to use df.iloc[row number] , but it didn't give me an expected result ?
my code as below:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2]
The result i get is:
first_name Marry
last_name Jackson
age 37
Name: 2, dtype: object
I expected it could give me a list , sth like below:
['Marry', 'Jackson','37']
Any idea to do this, could you please advise for my case?
Well there are many functions in pandas that could help you do this. to_String() or values are a few among them.
So if you do something like
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2].to_String()
print(df_temp)
you will get an output like this for your given code:
first_name Marry
last_name Jackson
age 37
however in your case because you want a list you can just call values and get it as you want. Here's your updated code below:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2].values
print(df_temp)
which will give you the output you probably want as
['Marry' 'Jackson' 37]
I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?
Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]
Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]
I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]
If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3
I have a pandas dataset and I was wondering if I can include it into a dictionary to export it as pickle together with other stuff.
i.e.
import pandas as pd
import pickle
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
dict_ = {"other_stuff": "blabla", "pandas": df}
pickle.dump(dict, "shared.pkl")
When I open it, using:
fp = open("shared.pkl",'rb')
shared = pickle.load(fp)
df= shared["pandas"]
the pandas datframe is empty. Any idea if this is even possible or how to do it ?
EDIT:
I know that I can simply pickle the pandas object itself df.to_pickle("shared.pkl"), but I am interested in saving the other stuff together with the pandas document in one convenient pickle file.
You can save the dict with
with open('shared.pkl', 'wb') as f:
pickle.dump(dict, f)
and then open it with
with open('shared.pkl', 'rb') as f:
dict_ = pickle.load(f)
The pickle.dump command expects a file object, not the name of the file.
from io import BytesIO
outfile = BytesIO()
pickle.dump(dict_, outfile)
outfile.seek(0)
unpickled_dict = pickle.read(outfile)
unpickled_dict['pandas'].info()
will give you the expected output. The dataframe with plain datatypes should pickle just fine.
You can save it into a list or in a dict and then pickle it like normal.
And don't use built-in name like dict or list as name. Bad style and lead to unexpected events.
import pandas as pd
import pickle
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
data = {"other_stuff": "blabla", "pandas": df}
file_write = open(b"resources/shared.pkl","wb")
pickle.dump(data, file_write)
file_write.close()
file_read = open(b"resources/shared.pkl","rb")
shared = pickle.load(file_read)
file_read.close()
print(shared["other_stuff"])
print(shared["pandas"])
Output
blabla
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
I'm looking to fill in missing values of one column with the mode of the value from another column. Let's say this is our data set (borrowed from Chris Albon):
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jake', 'Jake', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze'],
'age': [42, np.nan, 36, 24, 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
I know we can fill in missing postTestScore with each sex's mean value of postTestScore with:
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df
But how would we fill in missing sex with each first name's mode value of sex (obviously this is not politically correct, but as an example this was an easy data set to use). So for this example the missing sex value would be 'm' because there are two Jake's with the value 'm'. If there were a Jake with value 'f' it would still pick 'm' as the mode value because 2 > 1. It would be nice if you could do:
df["sex"].fillna(df.groupby("first_name")["sex"].transform("mode"), inplace=True)
df
I looked into value_counts and apply but couldn't find this specific case. My ultimate goal is to be able to look at one column and if that doesn't have a mode value then to look at another column for a mode value.
You need call the mode function with pd.Series.mode
df.groupby("first_name")["sex"].transform(pd.Series.mode)
Out[432]:
0 m
1 m
2 f
3 m
4 f
Name: sex, dtype: object