I am trying to append the DataFrame into existing DataFrame using loop. Currently, new_data has 4 values each column. I want to go through loop and add new data which is df2 with the 3 values each column every time loop iterates.
new_data = df = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
for i in range(len(5)):
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
print(df2)
new_data.append(df2)
The final result should have 19 values each column,for example
a b
----
1 5
2 6
3 7
4 8
1 5
2 6
3 7
1 5
2 6
3 7
1 5
2 6
3 7
1 5
2 6
3 7
1 5
2 6
3 7
But for some reason it's not working and I am confused. When I try to perform the operation without a loop it is working properly.
For example:
# Creating the first Dataframe using dictionary
df1 = df = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
# Creating the Second Dataframe using dictionary
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
# Print df1
print(df1, "\n")
df1.append(df2)
I don't understand what the issue is here. Please explain to me what the issue is here.
You need to:
df1 = df1.append(df2)
And even better, don't use append which will be deprecated soon and use concat instead:
df1 = pd.concat([df1, df2])
Instead of doing loop, you can use pd.concat to replicate your dataframe df2 according to your desired times. Only then, you join both dataframe together.
replicate = 5
new_df2 = pd.concat([df2]*replicate)
pd.concat([new_data, new_df2], ignore_index=True)
Out[34]:
a b
0 1 5
1 2 6
2 3 7
3 4 8
4 1 5
5 2 6
6 3 7
7 1 5
8 2 6
9 3 7
10 1 5
11 2 6
12 3 7
13 1 5
14 2 6
15 3 7
16 1 5
17 2 6
18 3 7
Related
I am trying to get a csv output of values for a given window in a rolling method but I am getting an error must be real number, not str.
It appears that the output must be of numeric type.
https://github.com/pandas-dev/pandas/issues/23002
df = pd.DataFrame({"a": range(1,10)})
df.head(10)
# Output
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
Tried:
to_csv = lambda x: x.to_csv(index=False)
# to_csv = lambda x: ",".join([str(d) for d in x])
df["running_csv"] = df.rolling(min_periods=1, window=3).apply(to_csv) # <= Causes Error
# Error:
# TypeError: must be real number, not str
Expected Output
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
Question: Is there any alternative way to get the CSV output like shown above?
Something like this?
>>> df['running_csv'] = pd.Series(df.rolling(min_periods=1, window=3)).apply(lambda x:x.a.values)
>>> df
a running_csv
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 4 [2, 3, 4]
4 5 [3, 4, 5]
5 6 [4, 5, 6]
6 7 [5, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
From here, further processing should be easy enough.
While it would be great to be able to do this using:
df['a'].astype(str).rolling(min_periods=1, window=3).apply(''.join)
as you mentioned, rolling currently does not work with strings
Here is one way:
(pd.DataFrame({i: df['a'].astype(str).shift(i) for i in range(3)[::-1]})
.fillna('')
.apply(','.join, axis=1)
.str.strip(',')
)
output:
0 1
1 1,2
2 1,2,3
3 2,3,4
4 3,4,5
5 4,5,6
6 5,6,7
7 6,7,8
8 7,8,9
Riding on the incomplete/partial solution by #fsimonjetz, we can complete it to generate the CSV values, as follows:
df['running_csv'] = (pd.Series(df.rolling(min_periods=1, window=3))
.apply(lambda x: x['a'].astype(str).values)
.str.join(',')
)
or further simplify it and enhance it to all vectorized functions, as follows:
df['running_csv'] = (pd.Series(df['a'].astype(str).rolling(min_periods=1, window=3))
.str.join(','))
Now, we have turned all (slow) .apply() and lambda functions to only (fast) vectorized functions.
Result:
print(df)
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
pd.rolling doesn't work if the output is not a numeric. Multiple issues exist on github. However, it's possible to get the outcome:
to_str_list = lambda x: ','.join(x[x.notna()].astype(int).astype(str).values.tolist())
df['running_csv'] = pd.concat([df['a'].shift(2), df['a'].shift(1), df['a']], axis=1) \
.apply(to_str_list), axis=1)
>>> df
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
I have this column of numbers, sorted by value
import pandas as pd
# initialize list of lists
data = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 6, 6, 6]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['First'])
df.head(19)
First
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 5
15 6
16 6
17 6
I would like to add a column, that increments by a certain number down the column if the column value is in a particular row is the same as the previous value. When going down the rows, and encountering a new row value in 'First', the increments restarts at 1 (or zero).
Here is an example dataframe of the result I am looking for.
First Second
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 3 1
7 3 2
8 3 3
9 3 4
10 3 5
11 3 6
12 4 1
13 4 2
14 5 1
15 6 1
16 6 2
17 6 3
What have tried so far
I tried extracting the column to a list, and using a loop to create the new column, which then can be appended to dataframe
colimnList = df['First'].tolist()
newColumn = []
old = -1
toAdd = 1
for item in colimnList:
if item == old:
toAdd+=1
newColumn.append(toAdd)
else:
toAdd=1
newColumn.append(toAdd)
old = item
newColumn
[1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5, 6, 1, 2, 1, 1, 2, 3]
Is there a method that is more efficient computationally, or at least more programmatically elegant? Possibly done in pure pandas?
Pandas has a "groupby" operation that does almost exactly what you need, called cumcount. It starts each group from zero, whereas you want to start from one, so just add one and you'll get the result you want:
df.groupby('First').cumcount() + 1
import pandas as pd
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8]} )
dummy = df["A"]
print(dummy)
0 1
1 1
2 2
3 3
4 4
5 5
6 5
7 6
8 7
9 7
10 7
11 8
Name: A, dtype: int64
res = df.groupby(dummy)
print(res.first())
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8]
Why the last print results in an empty dataframe? I except each group to be a slice of the original df, where each slice would contain as many rows as the number of duplicates for a given value in column "A". What am I missing?
My guess is by default, A is set to index before applying the groupby operator (e.g. first). Therefore, df is essentially empty before the actual first operator is applied. If you have another column B:
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8], 'B':range(12)} )
then you would see A as the index and the first values for B in each group with df.groupby(dummy).first():
B
A
1 0
2 2
3 3
4 4
5 5
6 7
7 8
8 11
On the other note, if you force as_index=False, groupby would not set A as index and you would have the non-empty data:
df.groupby(dummy, as_index=False).first()
gives:
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
Or, you can groupby on a copy of the column:
df.groupby(dummy.copy()).first()
and you get:
A
A
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
By default, as_index is True which means it will take the passed column and make it the index and then group the other columns of DataFrame accordingly. You need to make as_index=False to get your desired results.
import pandas as pd
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8]} )
dummy = df["A"]
print(dummy)
res = df.groupby(dummy,as_index=False)
print(res.first())
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
as_index : bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
I have an excel archive with different numbers and i open it using pandas.
when i read and then print the xslx archive ,i have something like this:
5 7 7
0 6 16 5
1 10 12 15
2 1 5 6
3 5 6 18
. . . .
. . . .
n . . n
All i need is to distribute them with different intervals according to their frequencies.
my code is
import pandas as pd
excel_archive=pd.read_exceL("file name")
print(excel)
I think excel file has no header, so first add header=None to read_excel and then use DataFrame.stack with Series.value_counts:
excel_archive=pd.read_exceL("file name", header=None)
s = excel_archive.stack().value_counts()
print (s)
5 4
6 3
7 2
15 1
12 1
10 1
18 1
1 1
16 1
dtype: int64
Your question is not very clear but if you just have to count the number of occurrence you can try something like this:
#generate a dataframe
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 4], [7, 8, 9], [1, 5, 2], [7, 9, 9]]),columns=['a', 'b', 'c'])
#Flatten the array
df_flat=df.stack().reset_index(drop=True)
#Count the number of occurences
df_flat.groupby(df_flat).size()
This is the input:
a b c
0 1 2 3
1 4 5 4
2 7 8 9
3 1 5 2
4 7 9 9
And this is the output:
1 2
2 2
3 1
4 2
5 2
7 2
8 1
9 3
If you want instead to divide in some predefined intervals you can use pd.cut together with groupby:
#define intervals
intervals = pd.IntervalIndex.from_arrays([0,3,6],[3,6,9],closed='right')
#cut and groupby
df_flat.groupby(pd.cut(df_flat,intervals)).size()
and the result would be:
(0, 3] 5
(3, 6] 4
(6, 9] 6
I have a 'for' loop that is calling a function (y) on each iteration. The function returns a 5 column by ten row dataframe called phstab.
for j in cycles
phstab=y(j)
The last column in the dataframe is the only one that changes. The value in the last column is the value for cycles. All the other values in the other columns all stay the same on each iteration. So if the loop iterates for time for example, it will produce four separate instances of phstab; each instance with a different value of cycles.
I'd like to append phstab on each iteration so so the output is just one long dataframe instead of four instances. I tried inserting the following statement in the loop but it didn't work
phstab=phstab.append(phstab)
How do I get one single dataframe instead of four separate instances ?
I'm assuming your y(j) returns something like this:
In [35]: def y(j):
...: return pd.DataFrame({'a': range(10),
...: 'b': range(10),
...: 'c': range(10),
...: 'd': range(10),
...: 'e_cycle' : j})
To iterate over this function, adding columns for each iterations, I'd do something like this. On the first pass, the dataframe is just set to phstab. On each subsequent iteration, a new column is added to phstab based on results of y(j).
I'm assuming you need to rename columns, if y(j) returns a unique column based on the value of j, you'll have to modify to fit.
In [38]: cycles = range(5)
In [38]: for i,j in enumerate(cycles):
...: if i == 0:
...: phstab = y(j)
...: phstab = phstab.rename(columns = {'e_cycle' : 'e_' + str(j)})
...: else:
...: phstab['e_' + str(j)] = y(j)['e_cycle']
In [38]: phstab
Out[38]:
a b c d e_0 e_1 e_2 e_3 e_4
0 0 0 0 0 0 1 2 3 4
1 1 1 1 1 0 1 2 3 4
2 2 2 2 2 0 1 2 3 4
3 3 3 3 3 0 1 2 3 4
4 4 4 4 4 0 1 2 3 4
5 5 5 5 5 0 1 2 3 4
6 6 6 6 6 0 1 2 3 4
7 7 7 7 7 0 1 2 3 4
8 8 8 8 8 0 1 2 3 4
9 9 9 9 9 0 1 2 3 4
[10 rows x 9 columns]
Edit:
Thanks for clarifying. To have the output in long format, you can you pd.concat, as below.
In [47]: pd.concat([y(j) for j in cycles], ignore_index=True)
Out[47]:
a b c d e_cycle
0 0 0 0 0 0
1 1 1 1 1 0
2 2 2 2 2 0
3 3 3 3 3 0
4 4 4 4 4 0
5 5 5 5 5 0
6 6 6 6 6 0
7 7 7 7 7 0
8 8 8 8 8 0
9 9 9 9 9 0
10 0 0 0 0 1
11 1 1 1 1 1
.....
[50 rows x 5 columns]
I believe a very simple solution is
my_dataframes = []
for j in cycles:
phstab = y(j)
my_dataframes.append(phstab)
full_dataframe = pd.concat(my_dataframes)
Alternatively and more concisely (credit to #chrisb above)
full_dataframe = pd.concat([y(j) for j in cycles], ignore_index=True)
pd.concat merges a list of dataframes together vertically. Ignoring the index is important so that the merged version doesn't retain the indices of the individual dataframes - otherwise you might end up with an index of [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3] when instead you'd want [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].