Pandas: merge dataframes and consolidate multiple joined values into an array

Pandas: merge dataframes and consolidate multiple joined values into an array - python

I'm very new to Python, and am using Pandas to convert a bunch of MySQL tables to JSON. My current solution works just fine, but (1) it is not very pythonic, and (2) I feel like there must be some pre-baked Pandas fucntion that does what I need...? Any guidance for the following problem would be helpful.
Say I have two data frames, authors and a join table plays_authors that represents a 1:many relationship of authors to plays.
print authors
> author_id dates notes
> 0 1 1700s a
> 1 2 1800s b
> 2 3 1900s c
print plays_authors
> author_id play_id
> 0 1 12
> 1 1 13
> 2 1 21
> 3 2 18
> 4 3 3
> 5 3 7
I want to merge plays_authors onto authors, but instead of having multiple rows per author (1 per play_id), I want one row per author, with an array of play_id values so that I can easily export them as json records.
print authors
> author_id dates notes play_id
> 0 1 1700s a [12, 13, 21]
> 1 2 1800s b [18]
> 2 3 1900s c [3, 7]
authors.to_json(orient="records")
> '[{
> "author_id":"1",
> "dates":"1700s",
> "notes":"a",
> "play_id":["12","13","21"]
> },
> {
> "author_id":"2",
> "dates":"1800s",
> "notes":"b",
> "play_id":["18"]
> },
> {
> "author_id":"3",
> "dates":"1900s",
> "notes":"c",
> "play_id":["3","7"]
> }]'
My current solution:
# main_df: main dataframe to transform
# join_df: the dataframe of the join table w values to add to df
# main_index: name of main_df index column
# multi_index: name of column w/ multiple values per main_index, added by merge with join_df
# jointype: type of merge to perform, e.g. left, right, inner, outer
def consolidate(main_df, join_df, main_index, multi_index, jointype):
# merge
main_df = pd.merge(main_df, join_df, on=main_index, how=jointype)
# consolidate
new_df = pd.DataFrame({})
for i in main_df[main_index].unique():
i_rows = main_df.loc[main_df[main_index] == i]
values = []
for column in main_df.columns:
values.append(i_rows[:1][column].values[0])
row_dict = dict(zip(main_df.columns, values))
row_dict[multi_index] = list(i_rows[multi_index])
new_df = new_df.append(row_dict, ignore_index=True)
return new_df
authors = consolidate(authors, plays_authors, 'author_id', 'play_id', 'left')
Is there a simple groupby / better dict solution out there that's currently just over my head?

Data:
In [131]: a
Out[131]:
author_id dates notes
0 1 1700s a
1 2 1800s b
2 3 1900s c
In [132]: pa
Out[132]:
author_id play_id
0 1 12
1 1 13
2 1 21
3 2 18
4 3 3
5 3 7
Solution:
In [133]: a.merge(pa.groupby('author_id')['play_id'].apply(list).reset_index())
Out[133]:
author_id dates notes play_id
0 1 1700s a [12, 13, 21]
1 2 1800s b [18]
2 3 1900s c [3, 7]

Related

Iterate through two variables in Pandas Dataframe

Suppose I have the following dataframe:
CategoryID Days Views
a 1 19
a 2 2000
a 5 5667
a 7 7899
b 1 2
b 3 245
c 1 1
c 2 252
c 7 2657
Given a threshold = n, I want to create two lists and I'll append them until I reach that threshold + 1 element for each category.
So, if n < 4, I expect for category a:
days_list = [1,2,5]
views_list = [19, 2000, 5667]
After that, I want to apply a function in those lists and then, start the iteration in the next category. However, I'm facing two issues with the following code:
I can't iterate properly when i == 0
The iteration does not go to the next category.
df['interpolated'] = int
days_list = []
views_list = []
for i,post in enumerate(category):
if df['category_id'].iloc[i-1] != post:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['category_id'].iloc[i] == post and df[category_id].iloc[i-1] == post:
if df['days new'].iloc[i] < 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['days new'].iloc[i] != 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
break
# Calculate the interpolation
interpolator = log_interp1d(days_list,views_list)
df['interpolated'] = round(interpolator(4).astype(int))
# Reset the lists after the category loop
days_list = []
views_list = []
Can someone give me some light? Thanks!

You can use a row_number type operation.
....
df['row_number'] = df.groupby(['CategoryId']).cumcount+1
Then, you will have a dataframe
CategoryID Days Views row_number
a 1 19 1
a 2 2000 2
a 5 5667 3
a 7 7899 4
b 1 2 1
b 3 245 2
c 1 1 1
c 2 252 2
c 7 2657 3
Then, you should be able to use boolean filtering to get what you want. So for your example,
df_category_a_filtered_4 = df[(df['row_number'] == 3]) & (df['CategoryID'] == 'a')]
Which will filter your dataframe so that the two lists you want are the two columns. This can be functionized obviously to do whatever you need.
If you want a more specific output, please specify what that would look like.

Alternatives to Dataframe.iterrows() or Dataframe.itertuples()?

My understanding of a Pandas dataframe vectorization (through Pandas vectorization itself or through Numpy) is applying a function to an array, similar to .apply() (Please correct me if I'm wrong). Suppose I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'color' : ['red','blue','yellow','orange','green',
'white','black','brown','orange-red','teal',
'beige','mauve','cyan','goldenrod','auburn',
'azure','celadon','lavender','oak','chocolate'],
'group' : [1,1,1,1,1,
1,1,1,1,1,
1,2,2,2,2,
4,4,5,6,7]})
df = df.set_index('color')
df
For this data, I want to apply a special counter for each unique value in A. Here's my current implementation of it:
df['C'] = 0
for value in set(df['group'].values):
filtered_df = df[df['group'] == value]
adj_counter = 0
initialize_counter = -1
spacing_counter = 20
special_counters = [0,1,-1,2,-2,3,-3,4,-4,5,-5,6,-6,7,-7]
for color,rows in filtered_df.iterrows():
if len(filtered_df.index) < 7:
initialize_counter +=1
df.loc[color,'C'] = (46+special_counters[initialize_counter])
else:
spacing_counter +=1
if spacing_counter > 5:
spacing_counter = 0
df.loc[color,'C'] = spacing_counter
df
Is there a faster way to implement this that doesn't involve iterrows or itertuples? Since the counting in the C columns is very irregular, I'm not sure as how I could implement this through apply or even through vectorization

What you can do is first create the column 'C' with groupby on the column 'group' and cumcount that would almost represent spacing_counter or initialize_counter depending on if len(filtered_df.index) < 7 or not.
df['C'] = df.groupby('group').cumcount()
Now you need to select the appropriate rows to do the if or the else part of your code. One way is to create a series using groupby again and transform to know the size of the group related to each row. Then, use loc on you df using this series and do: if the value is smaller than 7, you can map your values with the special_counters else just use modulo % 6
ser_size = df.groupby('group')['C'].transform('size')
df.loc[ser_size < 7,'C'] = df.loc[ser_size < 7,'C'].map(lambda x: 46 + special_counters[x])
df.loc[ser_size >= 7,'C'] %= 6
at the end, you get as expected:
print (df)
group C
color
red 1 0
blue 1 1
yellow 1 2
orange 1 3
green 1 4
white 1 5
black 1 0
brown 1 1
orange-red 1 2
teal 1 3
beige 1 4
mauve 2 46
cyan 2 47
goldenrod 2 45
auburn 2 48
azure 4 46
celadon 4 47
lavender 5 46
oak 6 46
chocolate 7 46

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"

Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Using the .loc function of the pandas dataframe

i have a pandas dataframe whose one of the column is :
a = [1,0,1,0,1,3,4,6,4,6]
now i want to create another column such that any value greater than 0 and less than 5 is assigned 1 and rest is assigned 0 ie:
a = [1,0,1,0,1,3,4,6,4,6]
b = [1,0,1,0,1,1,1,0,1,0]
now i have done this
dtaframe['b'] = dtaframe['a'].loc[0 < dtaframe['a'] < 5] = 1
dtaframe['b'] = dtaframe['a'].loc[dtaframe['a'] >4 or dtaframe['a']==0] = 0
but the code throws and error . what to do ?

You can use between to get Boolean values, then astype to convert from Boolean values to 0/1:
dtaframe['b'] = dtaframe['a'].between(0, 5, inclusive=False).astype(int)
The resulting output:
a b
0 1 1
1 0 0
2 1 1
3 0 0
4 1 1
5 3 1
6 4 1
7 6 0
8 4 1
9 6 0
Edit
For multiple ranges, you could use pandas.cut:
dtaframe['b'] = pd.cut(dtaframe['a'], bins=[0,1,6,9], labels=False, include_lowest=True)
You'll need to be careful about how you define bins. Using labels=False will return integer indicators for each bin, which happens to correspond with the labels you provided. You could also manually specify the labels for each bin, e.g. labels=[0,1,2], labels=[0,17,19], labels=['a','b','c'], etc. You may need to use astype if you manually specify the labels, as they'll be returned as categories.
Alternatively, you could combine loc and between to manually specify each range:
dtaframe.loc[dtaframe['a'].between(0,1), 'b'] = 0
dtaframe.loc[dtaframe['a'].between(2,6), 'b'] = 1
dtaframe.loc[dtaframe['a'].between(7,9), 'b'] = 2

When using comparison operators and boolean logic to filter dataframes you can't use the pythonic idiom of a < myseries < b. Instead you need to (a < myseries) & (myseries < b)
cond1 = (0 < dtaframe['a'])
cond2 = (dtaframe['a'] <= 5)
dtaframe['b'] = (cond1 & cond2) * 1

Try this with np.where:
dtaframe['b'] = np.where(([dtaframe['a'] > 4) | (dtaframe['a']==0),0, dtaframe['a'])

create a summary of movements between prices by date in a pandas dataframe

I have a dataframe which shows; 1) dates, prices and 3) the difference between two prices by row.
dates | data | result | change
24-09 24 0 none
25-09 26 2 pos
26-09 27 1 pos
27-09 28 1 pos
28-09 26 -2 neg
I want to create a summary of the above data in a new dataframe. The summary would have 4 columns: 1) start date, 2) end date 3) number of days 4) run
For example using the above there was a positive run of +4 from 25-09 and 27-09, so I would want this in a row of a dataframe like so:
In the new dataframe there would be one new row for every change in the value of result from positive to negative. Where run = 0 this indicates no change from the previous days price and would also need its own row in the dataframe.
start date | end date | num days | run
25-09 27-09 3 4
27-09 28-09 1 -2
23-09 24-09 1 0
The first step I think would be to create a new column "change" based on the value of run which then shows either of: "positive","negative" or "no change". Then maybe I could groupby this column.

A couple of useful functions for this style of problem are diff() and cumsum().
I added some extra datapoints to your sample data to flesh out the functionality.
The ability to pick and choose different (and more than one) aggregation functions assigned to different columns is a super feature of pandas.
df = pd.DataFrame({'dates': ['24-09', '25-09', '26-09', '27-09', '28-09', '29-09', '30-09','01-10','02-10','03-10','04-10'],
'data': [24, 26, 27, 28, 26,25,30,30,30,28,25],
'result': [0,2,1,1,-2,0,5,0,0,-2,-3]})
def cat(x):
return 1 if x > 0 else -1 if x < 0 else 0
df['cat'] = df['result'].map(lambda x : cat(x)) # probably there is a better way to do this
df['change'] = df['cat'].diff()
df['change_flag'] = df['change'].map(lambda x: 1 if x != 0 else x)
df['change_cum_sum'] = df['change_flag'].cumsum() # which gives us our groupings
foo = df.groupby(['change_cum_sum']).agg({'result' : np.sum,'dates' : [np.min,np.max,'count'] })
foo.reset_index(inplace=True)
foo.columns = ['id','start date','end date','num days','run' ]
print foo
which yields:
id start date end date num days run
0 1 24-09 24-09 1 0
1 2 25-09 27-09 3 4
2 3 28-09 28-09 1 -2
3 4 29-09 29-09 1 0
4 5 30-09 30-09 1 5
5 6 01-10 02-10 2 0
6 7 03-10 04-10 2 -5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: merge dataframes and consolidate multiple joined values into an array - python

Related

Iterate through two variables in Pandas Dataframe

Alternatives to Dataframe.iterrows() or Dataframe.itertuples()?

Conditional length of a binary data series in Pandas

Using the .loc function of the pandas dataframe

create a summary of movements between prices by date in a pandas dataframe

Categories

Resources