This seems to be a very common question, but my question is slightly different. Most of the questions in SO I have searched for; gives how to create different variables iteratively. But I want to use those variables iteratively which are already in a dictionary lets say.
Consider using pandas, let us say I define 3 dataframes df_E1, df_E2, df_E3. Each of them has columns like name, date, purpose.
Let's say I want to print describe from all of them. df_E1.describe(), df_E2.describe(), df_E3.describe() . Now, instead of printing one by one, what if I have 20-30 of such dataframes and I want to describe from df_E{number}.describe() in a for loop. How can I do that?
df_E1 = {'name':['john','tom','barney','freddie'], 'number':['3','4','5','6'], 'description':['a','b','c','d']}
df_E2 = {'name':['andy','candy','bruno','mars'], 'number':['1','2','5','8'], 'description':['g','h','j','k']}
df_E3 = {'name':['donald','trump','harry','thomas'], 'number':['9','4','5','7'], 'description':['c','g','j','r']}
df_E1 = pd.DataFrame(df_E1)
df_E2 = pd.DataFrame(df_E2)
df_E3 = pd.DataFrame(df_E3)
print(df_E1.head())
print(df_E2.head())
print(df_E3.head())
#### instead of above three statements, is there any way I can print them in a
#### for loop. Below does not work as it gives just the string printed.
for i in range(1,4):
print(str('df_E')+str(i))
### Now if I am able to print dataframes somehow, I will be able to use all
### these in a loop. Eg. if I need to print describe of these, I will be able
### to do it in for loop:
for i in range(1,4):
print((str('df_E')+str(i)).describe()) // something like this which works
This isn't reflecting the already asked questions as they focus more on just creating the variables as string in for loop, which I know can be done using a dictionary. But here, the requirement is to use already present variables
Here the best is use dict:
d = {'df_E1':df_E1, 'df_E2':df_E2, 'df_E3':df_E3}
print (d)
{'df_E1': name number description
0 john 3 a
1 tom 4 b
2 barney 5 c
3 freddie 6 d, 'df_E2': name number description
0 andy 1 g
1 candy 2 h
2 bruno 5 j
3 mars 8 k, 'df_E3': name number description
0 donald 9 c
1 trump 4 g
2 harry 5 j
3 thomas 7 r}
for k, v in d.items():
print (v.describe())
name number description
count 4 4 4
unique 4 4 4
top john 5 b
freq 1 1 1
name number description
count 4 4 4
unique 4 4 4
top bruno 5 g
freq 1 1 1
name number description
count 4 4 4
unique 4 4 4
top harry 5 r
freq 1 1 1
But is it possible, but not recommended, with globals:
for i in range(1,4):
print(globals()[str('df_E')+str(i)])
name number description
0 john 3 a
1 tom 4 b
2 barney 5 c
3 freddie 6 d
name number description
0 andy 1 g
1 candy 2 h
2 bruno 5 j
3 mars 8 k
name number description
0 donald 9 c
1 trump 4 g
2 harry 5 j
3 thomas 7 r
You can do this simply by using the eval() function, but it is frowned upon :
for i in range(1, 4):
print(eval('df_E{}.describe()'.format(i)))
Here is a link on why it's considered a bad practice
Related
I have a table with two ID columns, I want to create a new ID that groups where these overlap.
The point of this is to understand what level you can sum the unique values linked to each id such that one total can be divided by the other, such that all value are covered and there is no double counting.
For example if there is a table like this:
ID 1
ID 2
1
1
1
2
2
3
3
4
3
5
4
5
I want to create a new id column like such:
ID 1
ID 2
ID 3
1
1
1
1
2
1
2
3
2
3
4
3
3
5
3
4
5
3
Thanks for any help and hopefully that is clear :)
I am very new to pandas and not sure where to begin
Thanks
This is inherently a graph problem, you can solve it robustly with networkx:
import networkx as nx
# make ids unique (ID1/1 ≠ ID2/1)
id1 = df['ID 1'].astype(str).radd('ID1_')
id2 = df['ID 2'].astype(str).radd('ID2_')
# make graph
G = nx.from_edgelist(zip(id1, id2))
# get subgraphs
new_ids = {k: i for i, s in enumerate(nx.connected_components(G), start=1)
for k in s}
df['ID 3'] = id1.map(new_ids)
Output:
ID 1 ID 2 ID 3
0 1 1 1
1 1 2 1
2 2 3 2
3 3 4 3
4 3 5 3
5 4 5 3
Your graph:
My current DF looks like below
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
9 7 3 9 5 Adam Holiday
3 2 3 4 5 Anna Work
1 4 6 8 5 Anna Work
4 1 6 8 5 Kate Off
2 1 6 1 5 Jon Off
My lists with specific values looks like below:
name = [Jon, Adam]
status = [Off, Work]
I need using those lists create new dataframes for all unique elements in "status" list. So it should looks like below:
df_off:
x y z x c name status
2 1 6 1 5 Jon Off
there is only one values, because name Kate in not in the list name
df_Work:
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
In second DF there is no "Anna" because she is not in list "name".
I hope it is clear. Do you have any idea how can I solve this issue?
Regard
Tomasz
First part, filter your data using:
name = ['Jon', 'Adam']
status = ['Off', 'Work']
df[df['name'].isin(name)&df['status'].isin(status)]
Then use groupby and transform the output to dictionary:
conditions = df['name'].isin(name)&df['status'].isin(status)
dfs = {'df_%s' % k:v for k,v in df[conditions].groupby('status')}
Then access your dataframes using:
>>> dfs['df_Work']
x y z x.1 c name status
0 1 2 3 2 5 Jon Work
1 1 2 5 4 5 Adam Work
You can even use multiple groups:
dfs = {'df_%s_%s' % k:v for k,v in df.groupby(['name', 'status'])}
dfs['df_Adam_Work']
If you goal is to save the subframes:
for groupname, df in df[conditions].groupby('status'):
df.to_excel(f'df_{groupname}.xlsx')
I have a dataframe with patients, date, medications, and diagnosis.
Each patient has a unique id ('pid'), and may or may not be treated with different drugs.
What is best practice to select all patients that at some point have been treated with a certain drug?
Since my dataset is so huge, for-loops and if-statement is the last resort.
Example:
IN:
pid drug
1 A
1 B
1 C
2 A
2 C
2 E
3 B
3 C
3 D
4 D
4 E
4 F
Select all patient who has at some point been treated with drug 'B'. Note that all entries of that patient must to be included, meaning not just treatments with drug B, but all treatments:
OUT:
1 A
1 B
1 C
3 B
3 C
3 D
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
I do support COLDSPEED's answer, but If you say
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
can be solved a lot simpler than hardcoding the if's
patients_B = df.loc[df['drug'] == 'B', 'pid]
or
patients_B = set(df.loc[df['drug'] == 'B', 'pid])
and then
result = df[df['pid'].isin(patients_B)]
The easiest method involves groupby + transform:
df[df.drug.eq('B').groupby(df.pid).transform('any')]
pid drug
0 1 A
1 1 B
2 1 C
6 3 B
7 3 C
8 3 D
In pursuit of a faster solution, call groupby on df, not a Series:
df[~df.groupby('pid').drug.transform(lambda x: x.eq('B').any())]
pid drug
3 2 A
4 2 C
5 2 E
9 4 D
10 4 E
11 4 F
Here is one way.
s = df.groupby('drug')['pid'].apply(set)
result = df[df['pid'].isin(s['B'])]
# pid drug
# 0 1 A
# 1 1 B
# 2 1 C
# 6 3 B
# 7 3 C
# 8 3 D
Explanation
Create a mapping series s as a separate initial step so that it
does not need recalculating for each result.
For the comparisons, use set for O(1) complexity lookup.
IIUC filter
df.groupby('pid').filter(lambda x : (x['drug']=='B').any())
Out[18]:
pid drug
0 1 A
1 1 B
2 1 C
6 3 B
7 3 C
8 3 D
Description
Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.
The approach used for now
I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.
I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.
Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.
Code
Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.
def sort_table_by_key(self, ascending=True, key=len):
"""
Sort the table inplace.
"""
# column_tmp = "".join(self._table.columns)
column_tmp = object() # Create a new object to avoid column name collision.
# Calculate the scores of the objects.
self._table[column_tmp] = self._table["snippet"].apply(key)
self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
del self._table[column_tmp]
Now this is not implemented, check github issue 3942.
I think you need argsort and then select by iloc:
df = pd.DataFrame({
'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
'B': list(range(9))
})
print (df)
A B
0 assdsd 0
1 sda 1
2 affd 2
3 asddsd 3
4 ffb 4
5 sdb 5
6 db 6
7 cf 7
8 d 8
def sort_table_by_length(column, ascending=True):
if ascending:
return df.iloc[df[column].str.len().argsort()]
else:
return df.iloc[df[column].str.len().argsort()[::-1]]
print (sort_table_by_length('A'))
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
print (sort_table_by_length('A', False))
A B
3 asddsd 3
0 assdsd 0
2 affd 2
5 sdb 5
4 ffb 4
1 sda 1
7 cf 7
6 db 6
8 d 8
How it working:
First get lengths to new Series:
print (df['A'].str.len())
0 6
1 3
2 4
3 6
4 3
5 3
6 2
7 2
8 1
Name: A, dtype: int64
Then get indices by sorted values by argmax, for descending ordering is used this solution:
print (df['A'].str.len().argsort())
0 8
1 6
2 7
3 1
4 4
5 5
6 2
7 0
8 3
Name: A, dtype: int64
Last change ordering by iloc:
print (df.iloc[df['A'].str.len().argsort()])
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
I have a DataFrame of people. One of the columns in this DataFrame is a place_id. I also have a DataFrame of places, where one of the columns is place_id and another is weather. For every person, I am trying to find the corresponding weather. Importantly, many people have the same place_ids.
Currently, my setup is this:
def place_id_to_weather(pid):
return place_df[place_df['place_id'] == pid]['weather'].item()
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)`
But this is untenably slow. I would like to speed this up. I suspect that I could achieve a speedup like this:
Instead of returning place_df[...].item(), which does a search for place_id == pid for that entire column and returns a series, and then grabbing the first item in that series, I really just want to curtail the search in place_df after the first match place_df['place_id']==pid has been found. After that, I don't need to search any further. How do I limit the search to first occurrences only?
Are there other methods I could use to achieve a speedup here? Some kind of join-type method?
I think you need drop_duplicates with merge, if there is only common columns place_id and weather in both DataFrames, you can omit parameter on (it depends of data, maybe on='place_id' is necessary):
df1 = place_df.drop_duplicates(['place_id'])
print (df1)
print (pd.merge(person_df, df1))
Sample data:
person_df = pd.DataFrame({'place_id':['s','d','f','s','d','f'],
'A':[4,5,6,7,8,9]})
print (person_df)
A place_id
0 4 s
1 5 d
2 6 f
3 7 s
4 8 d
5 9 f
place_df = pd.DataFrame({'place_id':['s','d','f', 's','d','f'],
'weather':['y','e','r', 'h','u','i']})
print (place_df)
place_id weather
0 s y
1 d e
2 f r
3 s h
4 d u
5 f i
def place_id_to_weather(pid):
#for first occurence add iloc[0]
return place_df[place_df['place_id'] == pid]['weather'].iloc[0]
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)
print (person_df)
A place_id weather
0 4 s y
1 5 d e
2 6 f r
3 7 s y
4 8 d e
5 9 f r
#keep='first' is by default, so can be omit
print (place_df.drop_duplicates(['place_id']))
place_id weather
0 s y
1 d e
2 f r
print (pd.merge(person_df, place_df.drop_duplicates(['place_id'])))
A place_id weather
0 4 s y
1 7 s y
2 5 d e
3 8 d e
4 6 f r
5 9 f r
The map function is your quickest method, the purpose of which is to avoid calling an entire dataframe to run some function repeatedly. This is what you ended up doing in your function i.e. calling an entire dataframe which is fine but not good doing it repeatedly. To tweak your code just a little will significantly speed up your process and only call the place_df dataframe once:
person_df['weather'] = person_df['place_id'].map(dict(zip(place_df.place_id, place_df.weather)))
You can use merge to do the operation :
people = pd.DataFrame([['bob', 1], ['alice', 2], ['john', 3], ['paul', 2]], columns=['name', 'place'])
# name place
#0 bob 1
#1 alice 2
#2 john 3
#3 paul 2
weather = pd.DataFrame([[1, 'sun'], [2, 'rain'], [3, 'snow'], [1, 'rain']], columns=['place', 'weather'])
# place weather
#0 1 sun
#1 2 rain
#2 3 snow
#3 1 rain
pd.merge(people, weather, on='place')
# name place weather
#0 bob 1 sun
#1 bob 1 rain
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow
In case you have several weather for the same place, you may want to use drop_duplicates, then you have the following result :
pd.merge(people, weather, on='place').drop_duplicates(subset=['name', 'place'])
# name place weather
#0 bob 1 sun
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow