Appending data frames in Pandas - python

I have a 'for' loop that is calling a function (y) on each iteration. The function returns a 5 column by ten row dataframe called phstab.
for j in cycles
phstab=y(j)
The last column in the dataframe is the only one that changes. The value in the last column is the value for cycles. All the other values in the other columns all stay the same on each iteration. So if the loop iterates for time for example, it will produce four separate instances of phstab; each instance with a different value of cycles.
I'd like to append phstab on each iteration so so the output is just one long dataframe instead of four instances. I tried inserting the following statement in the loop but it didn't work
phstab=phstab.append(phstab)
How do I get one single dataframe instead of four separate instances ?

I'm assuming your y(j) returns something like this:
In [35]: def y(j):
...: return pd.DataFrame({'a': range(10),
...: 'b': range(10),
...: 'c': range(10),
...: 'd': range(10),
...: 'e_cycle' : j})
To iterate over this function, adding columns for each iterations, I'd do something like this. On the first pass, the dataframe is just set to phstab. On each subsequent iteration, a new column is added to phstab based on results of y(j).
I'm assuming you need to rename columns, if y(j) returns a unique column based on the value of j, you'll have to modify to fit.
In [38]: cycles = range(5)
In [38]: for i,j in enumerate(cycles):
...: if i == 0:
...: phstab = y(j)
...: phstab = phstab.rename(columns = {'e_cycle' : 'e_' + str(j)})
...: else:
...: phstab['e_' + str(j)] = y(j)['e_cycle']
In [38]: phstab
Out[38]:
a b c d e_0 e_1 e_2 e_3 e_4
0 0 0 0 0 0 1 2 3 4
1 1 1 1 1 0 1 2 3 4
2 2 2 2 2 0 1 2 3 4
3 3 3 3 3 0 1 2 3 4
4 4 4 4 4 0 1 2 3 4
5 5 5 5 5 0 1 2 3 4
6 6 6 6 6 0 1 2 3 4
7 7 7 7 7 0 1 2 3 4
8 8 8 8 8 0 1 2 3 4
9 9 9 9 9 0 1 2 3 4
[10 rows x 9 columns]
Edit:
Thanks for clarifying. To have the output in long format, you can you pd.concat, as below.
In [47]: pd.concat([y(j) for j in cycles], ignore_index=True)
Out[47]:
a b c d e_cycle
0 0 0 0 0 0
1 1 1 1 1 0
2 2 2 2 2 0
3 3 3 3 3 0
4 4 4 4 4 0
5 5 5 5 5 0
6 6 6 6 6 0
7 7 7 7 7 0
8 8 8 8 8 0
9 9 9 9 9 0
10 0 0 0 0 1
11 1 1 1 1 1
.....
[50 rows x 5 columns]

I believe a very simple solution is
my_dataframes = []
for j in cycles:
phstab = y(j)
my_dataframes.append(phstab)
full_dataframe = pd.concat(my_dataframes)
Alternatively and more concisely (credit to #chrisb above)
full_dataframe = pd.concat([y(j) for j in cycles], ignore_index=True)
pd.concat merges a list of dataframes together vertically. Ignoring the index is important so that the merged version doesn't retain the indices of the individual dataframes - otherwise you might end up with an index of [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3] when instead you'd want [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].

Related

Create a pandas column counter that successively counts 4 lines and 6 lines

I have a pandas dataframe with multiple columns and I want to create a counter that successively counts 4 rows and 6 rows.
I would like it to look like the dataframe below:
index
counter
0
1
1
1
2
1
3
1
4
2
5
2
6
2
7
2
8
2
9
2
10
3
11
3
12
3
13
3
As you can see, the first 4 values in counter column is 1, then next 6 values are 2, then again next 4 values are 3.
Your question is a bit clear after edit, you can create an empty list and a counter variable, then iterate on the range of number of rows incrementing it by 10 i.e. (4+6), then at each iteration, create the required lists of length 4 and 6 with counter and counter+1, add it to the resulting list.
Finally take the slice from result list first df.shape[0] values (because it may have few more values than df.shape[0]), and assign it to the new column df['counter'].
result = []
counter = 1
for i in range(0, df.shape[0], 10):
result += [counter]*4
counter += 1
result += [counter]*6
counter += 1
df['counter'] = result[:df.shape[0]]
# result [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
OUTPUT:
index counter
0 0 1
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3

Sort 'pandas.core.series.Series' so that largest value is in the centre

I have a Pandas Series that looks like this:
import pandas as pd
x = pd.Series([3, 1, 1])
print(x)
0 3
1 1
2 1
I would like to sort the output so that the largest value is in the center. Like this:
0 1
1 3
2 1
Do you have any ideas on how to do this also for series of different lengths (all of them are sorted with decreasing values). The length of the series will always be odd.
Thank you very much!
Anna
First sort values and then use indexing with join values by concat:
x = pd.Series([6, 4, 4, 2, 2, 1, 1])
x = x.sort_values()
print (pd.concat([x[::2], x[len(x)-2:0:-2]]))
5 1
3 2
1 4
0 6
2 4
4 2
6 1
dtype: int64
x = pd.Series(range(7))
x = x.sort_values()
print (pd.concat([x[::2], x[len(x)-2:0:-2]]))
0 0
2 2
4 4
6 6
5 5
3 3
1 1
dtype: int64

Pandas apply a function over groups with same size response

I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.
Example DataFrame:
df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})
If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.
x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 -4
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:
df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))
x
1 [1, 1, 1, 1, 1]
2 [1, 2, -1, 2, -1]
Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:
df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 1
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:
df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))

How to perform an IF statement for duplicate values within the same column

I have a DataFrame and want to find duplicate values within a column and if found, create a new column add a zero for every duplicate case but leave the original value unchanged.
Original DataFrame:
Code1
1
2
3
4
5
1
2
1
1
New DataFrame:
Code1 Code2
1 1
2 2
3 3
4 4
5 5
6 6
1 10
2 20
1 100
1 1000
6 60
Use groupby and cumcount
df.assign(counts = df.groupby("Code1").cumcount(),
Code2=lambda x:x["Code1"]*10**(x["counts"])
).drop("counts", axis=1)
Code1 Code2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 1 10
6 2 20
7 1 100
8 1 1000
there might be a solution using transform (but I'm just not having time right now to investigate). However, here it's really explicit about what is happening
import pandas as pd
data = [1, 2, 3, 4, 5, 1, 2, 1, 1]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Code1'])
code2 = []
x = {}
for d in data:
if d not in x:
x[d] = d
else:
x[d] = x[d] * 10
code2.append(x[d])
df['Code2'] = code2
print(df)

How to filter groupby for first N items

In Pandas, how can I modify groupby to only take the first N items in the group?
Example
df = pd.DataFrame({'id': [1, 1, 1, 2, 2, 2, 2],
'values': [1, 2, 3, 4, 5, 6, 7]})
>>> df
id values
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
Desired functionality
# This doesn't work, but I am trying to return the first two items per group.
>>> df.groupby('id').first(2)
id values
0 1 1
1 1 2
3 2 4
4 2 5
What I've tried
I can perform a groupby and iterate through the groups to take the index of the first n values, but there must be a simpler solution.
n = 2 # First two rows.
idx = [i for group in df.groupby('id').groups.itervalues() for i in group[:n]]
>>> df.ix[idx]
id values
0 1 1
1 1 2
3 2 4
4 2 5
You can use head:
In [11]: df.groupby("id").head(2)
Out[11]:
id values
0 1 1
1 1 2
3 2 4
4 2 5
Note: In older versions this used to be equivalent to .apply(pd.DataFrame.head) but it's more efficient since 0.15 (?), now it uses cumcount under the hood.

Categories

Resources