My apologies if this is a duplicate but I couldn't find anything exactly like this myself.
I have two Astropy tables, let's say X and Y. Each has multiple columns but what I want to do is to compare them by setting various conditions on different columns.
For example, table X looks like this and has 1000 rows and 9 columns (let's say):
Name_X (str)
Date_X (float64)
Date (int32)
...
GaiaX21-116383
59458.633888888886
59458
...
GaiaX21-116382
59458.504375
59458
...
and table Y looks like this and has 500 rows and 29 columns (let's say):
Name_Y (str14)
Date_Y (float64)
Date (int32)
...
GaiaX21-117313
59461.911724537036
59461
...
GaiaX21-118760
59466.905173611114
59466
...
I want to compare the two tables- basically, check if the same 'Name' exists in both Tables. If it does, then I treat that as a "match" and take that entire row and put it in a new table and discard everything else (or store them in another temp Table).
So I wrote a function like this:
def find_diff(table1, table2, param): # table1 is bigger, param defines which column, assuming they have the same names;
temp = Table(table1[0:0])
table3 = Table(table1[0:0])
for i in range(0, len(table1)):
for j in range(0, len(table2)):
if table1[param][i] != table2[param][j]:
# temp.add_row(table2[j])
# else:
table3.add_row(table1[i])
return table3
While this in principle, works, it also takes a huge amount of time to finish. So it simply isn't practical to be running the code this way. Similarly, I want to apply other conditions for other columns (cross-matching the observation dates, for example).
Any suggestions would be greatly helpful, thank you!
It sounds like you want to do a table join on the name columns. This can be done as documented at https://docs.astropy.org/en/stable/table/operations.html#join.
E.g.
# Assume table_x and table_y
from astropy.table import join
table_xy = join(table_x, table_y, keys_left='Name_X', keys_right='Name_Y')
As a full example with non-unique key values:
In [10]: t1 = Table([['x', 'x', 'y', 'z'], [1,2,3,4]], names=['a', 'b'])
In [11]: t2 = Table([['x', 'y', 'y', 'Q'], [10,20,30,40]], names=['a', 'c'])
In [12]: table.join(t1, t2, keys='a')
Out[12]:
<Table length=4>
a b c
str1 int64 int64
---- ----- -----
x 1 10
x 2 10
y 3 20
y 3 30
I believe this site would be your best friend for this problem: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
So in theory I believe you would want something like this:
result = pd.merge(table_y, table_x, on="Name")
So the key difference here would be that you might need to play with the column names for the tables so that they have the same name. What this will do though is it will match on the "Name" column between the two tables and if they are the same then it will put it in the results variable. From there you can do whatever you would like with the dataframe
Related
I have a large pandas dataframe with many different types of observations that need different models applied to them. One column is which model to apply, and that can be mapped to a python function which accepts a dataframe and returns a dataframe. One approach would be just doing 3 steps:
split dataframe into n dataframes for n different models
run each dataframe through each function
concatenate output dataframes at the end
This just ends up not being super flexible particularly as models are added and removed. Looking at groupby it seems like I should be able to leverage that to make this look much cleaner code-wise, but I haven't been able to find a pattern that does what I'd like.
Also because of the size of this data, using apply isn't particularly useful as it would drastically slow down the runtime.
Quick example:
df = pd.DataFrame({"model":["a","b","a"],"a":[1,5,8],"b":[1,4,6]})
def model_a(df):
return df["a"] + df["b"]
def model_b(df):
return df["a"] - df["b"]
model_map = {"a":model_a,"b":model_b}
results = df.groupby("model")...
The expected result would look like [2,1,14]. Is there an easy way code-wise to do this? Note that the actual models are much more complicated and involve potentially hundreds of variables with lots of transformations, this is just a toy example.
Thanks!
You can use groupby/apply:
x.name contains the name of the group, here a and b
x contains the sub dataframe
df['r'] = df.groupby('model') \
.apply(lambda x: model_map[x.name](x)) \
.droplevel(level='model')
>>> df
model a b r
0 a 1 1 2
1 b 5 4 1
2 a 8 6 14
Or you can use np.select:
>>> np.select([df['model'] == 'a', df['model'] == 'b'],
[model_a(df), model_b(df)])
array([ 2, 1, 14])
I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]
So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.
I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
df.head(10)
x1 time grp
0 0.533131 1 c
1 1.486672 2 c
2 1.560158 3 c
3 -1.076457 4 a
4 -1.835047 5 a
5 -0.374595 6 b
6 -1.301875 7 b
7 -0.533907 8 c
8 0.052951 9 c
9 -0.257982 10 c
10 -0.442044 1 c
In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.
I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe.
I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.
# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()
before = df[df['time'] < 5].copy()
#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5.
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)
#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as
an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()
#I extract the list of group ids from the array
something = something.index
# I go back to my main dataframe and make group id the index
df = df.set_index('grp')
#I then apply .drop on the array of group ids
df = df.drop(something)
df = df.reset_index()
Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.
I am not sure If I get it, but let's say you have this data:
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.
list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
times_per_group[group] = list(df[df.grp ==group].time.unique())
condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
if condition:
keep.append(group)
Finally, you just keep the groups present in the list "keep":
df = df[df.grp.isin(keep)]
Let me know if I understood your question!
Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code.
If this results is what you´re looking for, you can just do:
keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]
In a pd.Series with dtype=category I have 253 unique values. Some of these occur quite often and others occur only once or twice. Now I would like to keep only the top 10 of these and replace the rest with np.nan.
I got as far as top = df['cats'].value_counts().head(10) to create the categories I want to keep. But now?
Something along the lines of df['cats'].apply(cat_replace, args=top)?
def cat_replace(c, top):
if c in top:
return c
else:
return np.nan
This however doesn't look too 'pandas' to me and I have a feeling there is a better way. Any better suggestions?
# Sample data.
df = pd.DataFrame(
{'cats': pd.Categorical(
list('abcdefghij') * 5
+ list('klmnopqrstuvwxyz'))}
)
top_n = 10
top_cats = df['cats'].value_counts().head(top_n).index.tolist()
df.loc[~df['cats'].isin(top_cats), 'cats'] = np.nan
Cribbing from
How can I keep the rows of a pandas data frame that match a particular condition using value_counts() on multiple columns
You could look at doing something like
top = set(df['cats'].value_counts().head(10))
df['cats'].apply(top.__contains__)
I am trying to derive a single row based on original input and then various changes to individual column values at different points in time. I have simplified the list below.
I have read in some data into my dataframe as so:
A B C D E
0 h h h h h
1 x
2 y 1
3 2 3
row 0 - "h" represents my original record.
rows 1 - 3 are changes over time to a specific column
I would like to create a single "result row" that would look something like:
'x', 'y, '2', '3' 'h'
Is there a simple way to do this with Pandas and Python with out excessive looping?
You can get it as a list like so:
>>> [df[s][df[s].last_valid_index()] for s in df]
['x', 'y', 2, 3, 'h']
If you need it as appended or something with a name, then you need to provide it with an index and then append it, like so:
df.append(pd.Series(temp, index=df.columns, name='total'))
# note, this returns a new object
# where 'temp' is the output of the code above
You can just try
#df=df.replace({'':np.nan})
df.ffill().iloc[[-1],:]