Iterate through two variables in Pandas Dataframe - python

Suppose I have the following dataframe:
CategoryID Days Views
a 1 19
a 2 2000
a 5 5667
a 7 7899
b 1 2
b 3 245
c 1 1
c 2 252
c 7 2657
Given a threshold = n, I want to create two lists and I'll append them until I reach that threshold + 1 element for each category.
So, if n < 4, I expect for category a:
days_list = [1,2,5]
views_list = [19, 2000, 5667]
After that, I want to apply a function in those lists and then, start the iteration in the next category. However, I'm facing two issues with the following code:
I can't iterate properly when i == 0
The iteration does not go to the next category.
df['interpolated'] = int
days_list = []
views_list = []
for i,post in enumerate(category):
if df['category_id'].iloc[i-1] != post:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['category_id'].iloc[i] == post and df[category_id].iloc[i-1] == post:
if df['days new'].iloc[i] < 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['days new'].iloc[i] != 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
break
# Calculate the interpolation
interpolator = log_interp1d(days_list,views_list)
df['interpolated'] = round(interpolator(4).astype(int))
# Reset the lists after the category loop
days_list = []
views_list = []
Can someone give me some light? Thanks!

You can use a row_number type operation.
....
df['row_number'] = df.groupby(['CategoryId']).cumcount+1
Then, you will have a dataframe
CategoryID Days Views row_number
a 1 19 1
a 2 2000 2
a 5 5667 3
a 7 7899 4
b 1 2 1
b 3 245 2
c 1 1 1
c 2 252 2
c 7 2657 3
Then, you should be able to use boolean filtering to get what you want. So for your example,
df_category_a_filtered_4 = df[(df['row_number'] == 3]) & (df['CategoryID'] == 'a')]
Which will filter your dataframe so that the two lists you want are the two columns. This can be functionized obviously to do whatever you need.
If you want a more specific output, please specify what that would look like.

Related

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Identify parent of hierarchical data in a dataframe given ordered index and depth only

Before I begin, I can hack something together to do this on a small scale, but my goal is to apply this to 200k+ row dataset, so efficiency is priority and I lack more... nuanced techniques. :-)
So, I have an ordered data set that represents data from a very complex hierarchical structure. I only have a unique ID, the tree depth, and the fact that it is in order. For example:
a
b
c
d
e
f
g
h
i
j
k
l
Which is stored as:
ID depth
0 a 0
1 b 1
2 c 2
3 d 3
4 e 3
5 f 2
6 g 2
7 h 3
8 i 0
9 j 1
10 k 2
11 l 1
Here's a line that should generate my example.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"],
"depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
What I want is to return either the index of each elements' nearest parent node or the parents' unique ID (they'll both work since they're both unique). Something like:
ID depth parent p.idx
0 a 0
1 b 1 a 0
2 c 2 b 1
3 d 3 c 2
4 e 3 c 2
5 f 2 b 1
6 g 2 b 1
7 h 3 g 6
8 i 0
9 j 1 i 8
10 k 2 j 9
11 l 1 i 8
My initial sloppy solution involved adding a column that was index-1, then self matching the data set with idx-1 (left) and idx (right), then identifying the maximum parent idx less than the child index... it didn't scale up well.
Here are a couple of routes to performing this task I've put together that work but aren't very efficient.
The first uses simple loops and includes a break to exit when the first match is identified.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
df["parent"] = ""
# loop over entire dataframe
for i1 in range(len(df.depth)):
# loop back up from current row to top
for i2 in range(i1):
# identify row where the depth is 1 less
if df.depth[i1] -1 == df.depth[i1-i2-1]:
# Set parent value and exit loop
df.parent[i1] = df.ID[i1-i2-1]
break
df.head(15)
This second merges the dataframe with itself and then uses a groupby to identify the maximum parent row less than each original row:
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
# Columns for comparision and merging
df["parent_depth"] = df.depth-1
df["row"]=df.index
# Merge to return ALL elements matching the parent depth of each row
df = df.merge(df[["ID","depth","row"]], left_on="parent_depth",right_on="depth",how="left",suffixes=('','_y'))
# Identify the maximum parent row less than the original row
g1 = df[ (df.row_y < df.row) | (df.row_y.isnull())].groupby("ID").max()
g1.reset_index(inplace=True)
#clean up
g1.drop(["parent_depth","row","depth_y","row_y"],axis=1,inplace=True)
g1.rename({"ID_y":"parent"},inplace=True)
g1.head(15)
I'm confident those with more experience can provide more elegant solutions, but since I got something working, I wanted to provide my "solution". Thanks!

Pandas: merge dataframes and consolidate multiple joined values into an array

I'm very new to Python, and am using Pandas to convert a bunch of MySQL tables to JSON. My current solution works just fine, but (1) it is not very pythonic, and (2) I feel like there must be some pre-baked Pandas fucntion that does what I need...? Any guidance for the following problem would be helpful.
Say I have two data frames, authors and a join table plays_authors that represents a 1:many relationship of authors to plays.
print authors
> author_id dates notes
> 0 1 1700s a
> 1 2 1800s b
> 2 3 1900s c
print plays_authors
> author_id play_id
> 0 1 12
> 1 1 13
> 2 1 21
> 3 2 18
> 4 3 3
> 5 3 7
I want to merge plays_authors onto authors, but instead of having multiple rows per author (1 per play_id), I want one row per author, with an array of play_id values so that I can easily export them as json records.
print authors
> author_id dates notes play_id
> 0 1 1700s a [12, 13, 21]
> 1 2 1800s b [18]
> 2 3 1900s c [3, 7]
authors.to_json(orient="records")
> '[{
> "author_id":"1",
> "dates":"1700s",
> "notes":"a",
> "play_id":["12","13","21"]
> },
> {
> "author_id":"2",
> "dates":"1800s",
> "notes":"b",
> "play_id":["18"]
> },
> {
> "author_id":"3",
> "dates":"1900s",
> "notes":"c",
> "play_id":["3","7"]
> }]'
My current solution:
# main_df: main dataframe to transform
# join_df: the dataframe of the join table w values to add to df
# main_index: name of main_df index column
# multi_index: name of column w/ multiple values per main_index, added by merge with join_df
# jointype: type of merge to perform, e.g. left, right, inner, outer
def consolidate(main_df, join_df, main_index, multi_index, jointype):
# merge
main_df = pd.merge(main_df, join_df, on=main_index, how=jointype)
# consolidate
new_df = pd.DataFrame({})
for i in main_df[main_index].unique():
i_rows = main_df.loc[main_df[main_index] == i]
values = []
for column in main_df.columns:
values.append(i_rows[:1][column].values[0])
row_dict = dict(zip(main_df.columns, values))
row_dict[multi_index] = list(i_rows[multi_index])
new_df = new_df.append(row_dict, ignore_index=True)
return new_df
authors = consolidate(authors, plays_authors, 'author_id', 'play_id', 'left')
Is there a simple groupby / better dict solution out there that's currently just over my head?
Data:
In [131]: a
Out[131]:
author_id dates notes
0 1 1700s a
1 2 1800s b
2 3 1900s c
In [132]: pa
Out[132]:
author_id play_id
0 1 12
1 1 13
2 1 21
3 2 18
4 3 3
5 3 7
Solution:
In [133]: a.merge(pa.groupby('author_id')['play_id'].apply(list).reset_index())
Out[133]:
author_id dates notes play_id
0 1 1700s a [12, 13, 21]
1 2 1800s b [18]
2 3 1900s c [3, 7]

An elegant way to make transformation of something like transpose in pandas faster

I have a pandas.Dataframe called a and the structure is as follows:
while I want to get the DataFrame structure is like:
where the b is like the transpose of a.
By convert a to b, I use the code :
id_uni = a['id'].unique()
b = pd.DataFrame(columns=['id']+[str(i) for i in range(1,4)])
b['id'] = id_uni
for i in id_uni:
for j in range(7):
ind = (a['id'] == i) & (a['w'] == j)
med = a.loc[ind, 't'].values
if med:
b.loc[b['id'] == i, str(j)] = med[0]
else:
b.loc[b['id'] == i, str(j)] = 0
The method is very brutal that I just use two for-loops to get all elements from a to b. And it is very slow. Do you have an efficient way to improve it?
You can use pivot:
print (df.pivot(index='id', columns='w', values='t'))
w 1 2 3
id
0 54 147 12
1 1 0 1
df1 = df.pivot(index='id', columns='w', values='t').reset_index()
df1.columns.name=None
print (df1)
id 1 2 3
0 0 54 147 12
1 1 1 0 1

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources