Dictionary of Dataframes select name of the dataframe - python

I am using Python.
I have a dictionary of Dataframes. Each dataframe has a name in the dictionary and I can reference it correctly no problem.
I am trying to take that name and add it as a column across every row. I am having a rough time doing this.

You can simply assign the name string to a new column for each DataFrame:
import pandas as pd
frames = {
'foo': pd.DataFrame({'a': [1, 2], 'b': [3, 4]}),
'bar': pd.DataFrame({'a': [9, 8], 'b': [7, 6]})
}
for name, df in frames.items():
df['name'] = name
print(df, '\n')
Gives:
a b name
0 1 3 foo
1 2 4 foo
a b name
0 9 7 bar
1 8 6 bar
Demo

You can iterate through your dictionary and do below:
for key in d.keys(): # d is the dictionary of dataframes
d[key]['new_col'] = key # df_name is the name string you want to add in dataframe.

You can use a dictionary comprehension and assign:
frames = {k: v.assign(name=k) for k, v in frames.items()}

Related

python pandas function df pointer doesn't change values

I'm trying to give a function a pointer to an existing df, and trying to copy values from one df to another. but after the function is finished, the values are not assigned to the original object.
how to recreate:
import pandas as pd
def copy(df, new_df):
new_df = df.copy()
# an example of things that would be modified on new_df
new_df[0] = "test"
# just editing df as an example, to show that df is being changed while new_df is not receiving the values
df[0] = "test"
if __name__ == '__main__':
mat = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(mat)
new_df = pd.DataFrame()
copy(df, new_df)
print(new_df)
if you notice, in this case i am assigning "test" to the first column, in this case it does assign the values to the original object from the pointed object, but new_df do not get the new values.
is this a bug in pandas? or am i doing something wrong?
edit:
the assigning of values to df[0] is just an example of how the values do change on the original df.
my question is, how do i assign the values from the original df to a new df(it could also be concat, not only copy) without having to return the df and create a new variable which receives the returned value from the function
(Scroll down to Edit section to find the answer)
In your case,
the copied df is not returned and only scoped to the function. So the new_df outside the function is never assigned the new values.
The "test" was assigned to the "df" and not "new_df" after the copying is done. That's why the changes will not reflect when you print the "new_df" even if the function is correct.
Try this out.
import pandas as pd
def copy(df, new_df):
new_df = df.copy()
new_df[0] = "test"
return new_df
if __name__ == "__main__":
mat = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(mat)
new_df = pd.DataFrame()
new_df = copy(df, new_df)
print(new_df)
output
0 1 2
0 test 2 3
1 test 5 6
2 test 8 9
Edit
Sorry that I completely missed the part where you needed to have two linked DataFrames. you could just assign the df to a new variable without copying it.
Try this out:
import pandas as pd
def copy(df):
new_df = df
df[0] = "test val"
return df, new_df
if __name__ == "__main__":
mat = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(mat)
df, new_df = copy(df)
print(df)
print(new_df)
output:
0 1 2
0 test val 2 3
1 test val 5 6
2 test val 8 9
0 1 2
0 test val 2 3
1 test val 5 6
2 test val 8 9

How to get rows from only some columns based on columns value in Pandas?

I use Pandas to get datas from Excel. From those tables, I often need to find one or some values in only one row, based on value in a column.
I've read a lot about Pandas (doc and SO), and almost everytime, the question is like « how to SELECT * FROM df WHERE value = smthing ».
But what I'd like to do is more like :
SELECT Col1, Col2
FROM df
WHERE Col3.value = smthing
And I can't find any answer.
For example :
>>> dataFrame
foo bar sm_else
0 0 3 6
1 1 4 7
2 2 5 8
I want to get foo value and sm_else value when bar == 4.
So :
foo sm_else
1 7
Result can be DataFrame or can be list or dict, I don't really care.
Thanks !
How can I achieve this ?
df.loc can help you out
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
print(df.loc[df['col2'] == 4][['col1', 'col2']])
df.loc[df.bar == 4, ['foo', 'sm_else']]

Creating Pandas DataFrame from list or dict always returns empty DF

I'm trying to create a pandas dataframe out of a dictionary. The dictionary keys are strings and the values are 1 or more lists. I'm having a strange issue in which pd.DataFrame() command consistently returns an empty dataframe even when I pass it a non-empty object like a list or dict.
My code is similar to the following:
myDictionary = {"ID1":[1,2,3], "ID2":[10,11,12],[2,34,11],"ID3":[8,3,12]}
df = pd.DataFrame(myDictionary, columns = ["A","B","C"])
So I want to create a DF that looks like this:
A B C
ID1 1 2 3
ID2 10 11 12
ID2 2 34 11
ID3 8 3 12
When I check the contents of df, I get "Empty DataFrame" and if I iterate over its contents, I get just the column names and none of the data in myDictionary! I have checked the documentation and this should be a strightforward command:
pd.DataFrame(dict, columns)
This doesn't get me the result I'm looking for and I'm baffled why. Anyone have any ideas? Thank you!
What I would recommend doing in this situation is interpreting your list of lists as strings. Later if you need to edit or analyze any of these you can use a parser to interpret the columns.
See below working code that allows you to keep your list of lists in the dataframe.
myDictionary = {"ID1":'[1,2,3]', "ID2":'[10,11,12],[2,34,11]',"ID3":'[8,3,12]'}
df = pd.DataFrame(myDictionary, columns = ["ID1","ID2","ID3"], index = [0])
df.rename(columns ={'ID1' : 'A', 'ID2': 'B', 'ID3': 'C'}, inplace = True)
df.head(3)
By always converting the lists to strings you will be able to combine them much easier, regardless of how many lists there are that need to be combined.
try the example below to figure out why df is empty:
myDictionary = {"ID1":[1,2,3], "ID2":[10,11,12],"ID3":[8,3,12], 'A':[0, 0, 0]}
df = pd.DataFrame(myDictionary, columns = ["A","B","C"])
and the what you want is:
myDictionary = {"ID1":[1,2,3], "ID2":[10,11,12],"ID3":[8,3,12]}
df = pd.DataFrame(myDictionary).rename(columns={'ID1':'A', 'ID2':'B', 'ID3':'C'})
You are passing in the names "ID1", "ID2", and "ID3" into pd.DataFrame as the column names and then telling pandas to use columns A, B, C. Since there are no columns A, B, C pandas returns an empty DataFrame. Use the code below to make the DataFrame:
import pandas as pd
myDictionary = {"ID1": [1, 2, 3], "ID2": [10, 11, 12], "ID3": [8, 3, 12]}
df = pd.DataFrame(myDictionary, columns=["ID1", "ID2", "ID3"])
print(df)
Output:
ID1 ID2 ID3
0 1 10 8
1 2 11 3
2 3 12 12
And moreover this:
"ID2":[10,11,12],[2,34,11]
Is incorrect since you are either trying to pass 2 keys for one value in a dictionary, or forgot to make a key for the values [2,34,11]. Thus your dictionary should be returning errors when you try and compile unless you remove that list.
Firstly the [2,34,11] list is missing a column name. GIVE IT A NAME!
The reason for your error is that when you use the following command:
df = pd.DataFrame(myDictionary, columns = ["A","B","C"])
It creates a dataframe based on your dictionary. But then you are saying that you only want columns from your dictionary that are labelled 'A', 'B', 'C', which your dictionary doesn't have.
Try instead:
df = pd.DataFrame(myDictionary, columns = ["ID1","ID2","ID3"])
df.rename(columns ={'ID1' : 'A', 'ID2': 'B', 'ID3': 'C'}, inplace = True)
you can not create a data frame where two row level will be same like yours example
ID2 10 11 12
ID2 2 34 11
and at the same time, it is also true for the dictionary as well, in the dictionary every key has to be unique but in yours dataframe metioned like below dictionary which is impossible
{"ID2":[10,11,12],"ID2":[2,34,11]}
so my suggestion chagne you dictionary design and follow so many answers about to convert dictinary to df
Here is one possible approach
Dictionary
myDictionary = {"ID1":[1,2,3], "ID2":[[10,11,12],[2,34,11]],"ID3":[8,3,12]}
Get a dictionary d that contains key-values for values that are nested lists whose (a) keys are unique - use a suffix to ensure the keys of this dictionary d are unique and (b) whose values are flattened sub-lists from the nested list
to do this, iterate through the loop and
check if the value contains a sublist
if so, append that key:value pair to a separate dictionary d
use a suffix to separate identical keys, since the key ID2 can't be repeated in a dictionary
each suffix will hold one of the sub-lists from the nested list
generate a list of keys from the original dictionary (in a variable named nested_keys myDictionary), whose values are nested lists
d = {}
nested_keys = []
for k,v in myDictionary.items():
if any(isinstance(i, list) for i in v):
for m,s in enumerate(v):
d[k+'_'+str(m+1)] = s
nested_keys.append(k)
print(d)
{'ID2_1': [10, 11, 12], 'ID2_2': [2, 34, 11]}
(Using the list of keys whose values are nested lists - nested_keys) Get a second dictionary that contains values that are not nested lists - see this SO post for how to do this
myDictionary = {key: myDictionary[key] for key in myDictionary if key not in nested_keys}
print(myDictionary)
{'ID1': [1, 2, 3], 'ID3': [8, 3, 12]}
Combine the 2 dictionaries above into a single dictionary
myDictionary = {**d, **myDictionary}
print(myDictionary)
{'ID2_1': [10, 11, 12], 'ID2_2': [2, 34, 11], 'ID1': [1, 2, 3], 'ID3': [8, 3, 12]}
Convert the combined dictionary into a DataFrame and drop the suffix that was added earlier
df = pd.DataFrame(list(myDictionary.values()), index=myDictionary.keys(),
columns=list('ABC'))
df.reset_index(inplace=True)
df = df.replace(r"_[0-9]", "", regex=True)
df.sort_values(by='index', inplace=True)
print(df)
index A B C
2 ID1 1 2 3
0 ID2 10 11 12
1 ID2 2 34 11
3 ID3 8 3 12

Creating a dataframe in a for loop based on another dataframe

I have a data frame, df, and I'd like to get all the columns in it and the count of unique values in it and save it as another data frame. I can't seem to find a way to do that. I can, however, print what I want on the console. Here's what I mean:
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
Now that prints what I want just fine. Instead of printing, if I do something like newdf = pd.DataFrame(evry_colm, df[evry_colm].value_counts().count(), columns = ('a', 'b')), it throws an error that reads "TypeError: object of type 'numpy.int32' has no len()". Obviously, that isn't right.
Soo, how can I make a data frame like columnName and UniqueCounts?
To count unique values per column you can use apply and nunique function on data frame.
Something like:
import pandas as pd
df = pd.DataFrame([
{'a': 1, 'b': 2},
{'a': 2, 'b': 2}
])
count_series = df.apply(lambda col: col.nunique())
# returned object is pandas Series
# a 2
# b 1
# to map it to DataFrame try
pd.DataFrame(count_series).T
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
print(df)
print()
df = pd.DataFrame({col: [df[col].nunique()] for col in df})
print(df)
Output:
A B
0 1 1
1 1 2
2 2 3
3 2 4
A B
0 2 4

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Categories

Resources