I am wondering whether there is a more efficient way to create a dataframe from a combinations object in Python. For example:
I have a list:
lst = [1, 2, 3, 4, 5] and pick_size = 3
combos_obj = itertools.combinations(lst, pick_size)
So far I have been using list comprehension:
combos = [list(i) for i in combos_obj]
df = pd.DataFrame(combos)
which gives me the result I want, but is there a more efficient way to do this?
Just create a dataframe directly from the generator that itertools.combinations returns:
>>> combos_obj = itertools.combinations(lst, pick_size)
>>> combos_obj
<itertools.combinations at 0x12ba9d220>
>>> df = pd.DataFrame(combos_obj)
>>> df
0 1 2
0 1 2 3
1 1 2 4
2 1 2 5
3 1 3 4
4 1 3 5
5 1 4 5
6 2 3 4
7 2 3 5
8 2 4 5
9 3 4 5
Related
I have list containing numbers x =(1,2,3,4,5,6,7,8)
I also have a DataFrame with 1000+ rows.
The thing I need is to assign the numbers in the list into a column/creating a new column, so that the rows 1-8 contain the numbers 1-8, but after that it starts again, so row 9 should contain number 1 and so on.
It seems really easy, but somehow I cannot manage to do this.
Here are two possible ways (example here with 3 items to repeat):
with numpy.tile
df = pd.DataFrame({'col': range(10)})
x = (1,2,3)
df['newcol'] = np.tile(x, len(df)//len(x)+1)[:len(df)]
with itertools
from itertools import cycle, islice
df = pd.DataFrame({'col': range(10)})
x = (1,2,3)
df['newcol'] = list(islice(cycle(x), len(df
input:
col
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
output:
col newcol
0 0 1
1 1 2
2 2 3
3 3 1
4 4 2
5 5 3
6 6 1
7 7 2
8 8 3
9 9 1
from math import ceil
df['new_column'] = (x*(ceil(len(df)/len(x))))[:len(df)]
I am trying to get a csv output of values for a given window in a rolling method but I am getting an error must be real number, not str.
It appears that the output must be of numeric type.
https://github.com/pandas-dev/pandas/issues/23002
df = pd.DataFrame({"a": range(1,10)})
df.head(10)
# Output
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
Tried:
to_csv = lambda x: x.to_csv(index=False)
# to_csv = lambda x: ",".join([str(d) for d in x])
df["running_csv"] = df.rolling(min_periods=1, window=3).apply(to_csv) # <= Causes Error
# Error:
# TypeError: must be real number, not str
Expected Output
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
Question: Is there any alternative way to get the CSV output like shown above?
Something like this?
>>> df['running_csv'] = pd.Series(df.rolling(min_periods=1, window=3)).apply(lambda x:x.a.values)
>>> df
a running_csv
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 4 [2, 3, 4]
4 5 [3, 4, 5]
5 6 [4, 5, 6]
6 7 [5, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
From here, further processing should be easy enough.
While it would be great to be able to do this using:
df['a'].astype(str).rolling(min_periods=1, window=3).apply(''.join)
as you mentioned, rolling currently does not work with strings
Here is one way:
(pd.DataFrame({i: df['a'].astype(str).shift(i) for i in range(3)[::-1]})
.fillna('')
.apply(','.join, axis=1)
.str.strip(',')
)
output:
0 1
1 1,2
2 1,2,3
3 2,3,4
4 3,4,5
5 4,5,6
6 5,6,7
7 6,7,8
8 7,8,9
Riding on the incomplete/partial solution by #fsimonjetz, we can complete it to generate the CSV values, as follows:
df['running_csv'] = (pd.Series(df.rolling(min_periods=1, window=3))
.apply(lambda x: x['a'].astype(str).values)
.str.join(',')
)
or further simplify it and enhance it to all vectorized functions, as follows:
df['running_csv'] = (pd.Series(df['a'].astype(str).rolling(min_periods=1, window=3))
.str.join(','))
Now, we have turned all (slow) .apply() and lambda functions to only (fast) vectorized functions.
Result:
print(df)
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
pd.rolling doesn't work if the output is not a numeric. Multiple issues exist on github. However, it's possible to get the outcome:
to_str_list = lambda x: ','.join(x[x.notna()].astype(int).astype(str).values.tolist())
df['running_csv'] = pd.concat([df['a'].shift(2), df['a'].shift(1), df['a']], axis=1) \
.apply(to_str_list), axis=1)
>>> df
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
For some reason, I cant think of the best solution for this problem. I am looking for a solution that does not involve iterating.
Lets say we have the following df:
df = pd.DataFrame([[3,5,6,2,3],[3,5,7,3,5],[5,5,3,5,4],[2,3,4,5,6]])
0 1 2 3 4
0 3 5 6 2 3
1 3 5 7 3 5
2 5 5 3 5 4
3 2 3 4 5 6
What would be the best way to find the rows with no duplicate numbers. Only the last row satisfies the requirement. I have come up with the solution below, but I feel like I am missing a more obvious answer.
df.rank(axis=1,method='dense').eq(len(df.columns)).any(axis=1)
0 False
1 False
2 False
3 True
dtype: bool
Is there a way to find duplicates across rows a better way?
As a bonus, what would be the best way to make a list that shows the numbers that were duplicated.
My solution works, but I feel like I am forgetting a much better way:
df.apply(lambda x: x.value_counts().loc[x.value_counts().gt(1)].index.tolist(),axis=1)
0 1 2 3 4 dups
0 3 5 6 2 3 [3]
1 3 5 7 3 5 [3, 5]
2 5 5 3 5 4 [5]
3 2 3 4 5 6 []
Let us try numpy way
def row_dup(a):
v, c = np.unique(a, return_counts=True)
return list(v[np.where(c > 1)])
df.apply(row_dup,1)
Out[222]:
0 [3]
1 [3, 5]
2 [5]
3 []
dtype: object
Let's try nunique on axis=1 to count the unique values per row and compare to width of the DataFrame:
df.nunique(axis=1).eq(df.shape[1])
0 False
1 False
2 False
3 True
dtype: bool
Edit to include an answer to the "bonus" question added after my initial answer via duplicated and unique:
df.apply(lambda s: s[s.duplicated()].unique(), axis=1)
df:
0 [3]
1 [3, 5]
2 [5]
3 []
dtype: object
#Henry's answer is already good, but for the "bonus" question:
df["dups"] = df.apply(
lambda x: (v := x.value_counts())[v > 1].index.tolist(), 1
)
print(df)
Prints:
0 1 2 3 4 dups
0 3 5 6 2 3 [3]
1 3 5 7 3 5 [3, 5]
2 5 5 3 5 4 [5]
3 2 3 4 5 6 []
In Pandas, how can I modify groupby to only take the first N items in the group?
Example
df = pd.DataFrame({'id': [1, 1, 1, 2, 2, 2, 2],
'values': [1, 2, 3, 4, 5, 6, 7]})
>>> df
id values
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
Desired functionality
# This doesn't work, but I am trying to return the first two items per group.
>>> df.groupby('id').first(2)
id values
0 1 1
1 1 2
3 2 4
4 2 5
What I've tried
I can perform a groupby and iterate through the groups to take the index of the first n values, but there must be a simpler solution.
n = 2 # First two rows.
idx = [i for group in df.groupby('id').groups.itervalues() for i in group[:n]]
>>> df.ix[idx]
id values
0 1 1
1 1 2
3 2 4
4 2 5
You can use head:
In [11]: df.groupby("id").head(2)
Out[11]:
id values
0 1 1
1 1 2
3 2 4
4 2 5
Note: In older versions this used to be equivalent to .apply(pd.DataFrame.head) but it's more efficient since 0.15 (?), now it uses cumcount under the hood.
I have a 'for' loop that is calling a function (y) on each iteration. The function returns a 5 column by ten row dataframe called phstab.
for j in cycles
phstab=y(j)
The last column in the dataframe is the only one that changes. The value in the last column is the value for cycles. All the other values in the other columns all stay the same on each iteration. So if the loop iterates for time for example, it will produce four separate instances of phstab; each instance with a different value of cycles.
I'd like to append phstab on each iteration so so the output is just one long dataframe instead of four instances. I tried inserting the following statement in the loop but it didn't work
phstab=phstab.append(phstab)
How do I get one single dataframe instead of four separate instances ?
I'm assuming your y(j) returns something like this:
In [35]: def y(j):
...: return pd.DataFrame({'a': range(10),
...: 'b': range(10),
...: 'c': range(10),
...: 'd': range(10),
...: 'e_cycle' : j})
To iterate over this function, adding columns for each iterations, I'd do something like this. On the first pass, the dataframe is just set to phstab. On each subsequent iteration, a new column is added to phstab based on results of y(j).
I'm assuming you need to rename columns, if y(j) returns a unique column based on the value of j, you'll have to modify to fit.
In [38]: cycles = range(5)
In [38]: for i,j in enumerate(cycles):
...: if i == 0:
...: phstab = y(j)
...: phstab = phstab.rename(columns = {'e_cycle' : 'e_' + str(j)})
...: else:
...: phstab['e_' + str(j)] = y(j)['e_cycle']
In [38]: phstab
Out[38]:
a b c d e_0 e_1 e_2 e_3 e_4
0 0 0 0 0 0 1 2 3 4
1 1 1 1 1 0 1 2 3 4
2 2 2 2 2 0 1 2 3 4
3 3 3 3 3 0 1 2 3 4
4 4 4 4 4 0 1 2 3 4
5 5 5 5 5 0 1 2 3 4
6 6 6 6 6 0 1 2 3 4
7 7 7 7 7 0 1 2 3 4
8 8 8 8 8 0 1 2 3 4
9 9 9 9 9 0 1 2 3 4
[10 rows x 9 columns]
Edit:
Thanks for clarifying. To have the output in long format, you can you pd.concat, as below.
In [47]: pd.concat([y(j) for j in cycles], ignore_index=True)
Out[47]:
a b c d e_cycle
0 0 0 0 0 0
1 1 1 1 1 0
2 2 2 2 2 0
3 3 3 3 3 0
4 4 4 4 4 0
5 5 5 5 5 0
6 6 6 6 6 0
7 7 7 7 7 0
8 8 8 8 8 0
9 9 9 9 9 0
10 0 0 0 0 1
11 1 1 1 1 1
.....
[50 rows x 5 columns]
I believe a very simple solution is
my_dataframes = []
for j in cycles:
phstab = y(j)
my_dataframes.append(phstab)
full_dataframe = pd.concat(my_dataframes)
Alternatively and more concisely (credit to #chrisb above)
full_dataframe = pd.concat([y(j) for j in cycles], ignore_index=True)
pd.concat merges a list of dataframes together vertically. Ignoring the index is important so that the merged version doesn't retain the indices of the individual dataframes - otherwise you might end up with an index of [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3] when instead you'd want [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].