Frequency table with disorganized datas made by pandas - python

I have an excel archive with different numbers and i open it using pandas.
when i read and then print the xslx archive ,i have something like this:
5 7 7
0 6 16 5
1 10 12 15
2 1 5 6
3 5 6 18
. . . .
. . . .
n . . n
All i need is to distribute them with different intervals according to their frequencies.
my code is
import pandas as pd
excel_archive=pd.read_exceL("file name")
print(excel)

I think excel file has no header, so first add header=None to read_excel and then use DataFrame.stack with Series.value_counts:
excel_archive=pd.read_exceL("file name", header=None)
s = excel_archive.stack().value_counts()
print (s)
5 4
6 3
7 2
15 1
12 1
10 1
18 1
1 1
16 1
dtype: int64

Your question is not very clear but if you just have to count the number of occurrence you can try something like this:
#generate a dataframe
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 4], [7, 8, 9], [1, 5, 2], [7, 9, 9]]),columns=['a', 'b', 'c'])
#Flatten the array
df_flat=df.stack().reset_index(drop=True)
#Count the number of occurences
df_flat.groupby(df_flat).size()
This is the input:
a b c
0 1 2 3
1 4 5 4
2 7 8 9
3 1 5 2
4 7 9 9
And this is the output:
1 2
2 2
3 1
4 2
5 2
7 2
8 1
9 3
If you want instead to divide in some predefined intervals you can use pd.cut together with groupby:
#define intervals
intervals = pd.IntervalIndex.from_arrays([0,3,6],[3,6,9],closed='right')
#cut and groupby
df_flat.groupby(pd.cut(df_flat,intervals)).size()
and the result would be:
(0, 3] 5
(3, 6] 4
(6, 9] 6

Related

Dataframe(pandas) append not working in a loop

I am trying to append the DataFrame into existing DataFrame using loop. Currently, new_data has 4 values each column. I want to go through loop and add new data which is df2 with the 3 values each column every time loop iterates.
new_data = df = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
for i in range(len(5)):
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
print(df2)
new_data.append(df2)
The final result should have 19 values each column,for example
a b
----
1 5
2 6
3 7
4 8
1 5
2 6
3 7
1 5
2 6
3 7
1 5
2 6
3 7
1 5
2 6
3 7
1 5
2 6
3 7
But for some reason it's not working and I am confused. When I try to perform the operation without a loop it is working properly.
For example:
# Creating the first Dataframe using dictionary
df1 = df = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
# Creating the Second Dataframe using dictionary
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
# Print df1
print(df1, "\n")
df1.append(df2)
I don't understand what the issue is here. Please explain to me what the issue is here.
You need to:
df1 = df1.append(df2)
And even better, don't use append which will be deprecated soon and use concat instead:
df1 = pd.concat([df1, df2])
Instead of doing loop, you can use pd.concat to replicate your dataframe df2 according to your desired times. Only then, you join both dataframe together.
replicate = 5
new_df2 = pd.concat([df2]*replicate)
pd.concat([new_data, new_df2], ignore_index=True)
Out[34]:
a b
0 1 5
1 2 6
2 3 7
3 4 8
4 1 5
5 2 6
6 3 7
7 1 5
8 2 6
9 3 7
10 1 5
11 2 6
12 3 7
13 1 5
14 2 6
15 3 7
16 1 5
17 2 6
18 3 7

Get CSV values on a pandas rolling function

I am trying to get a csv output of values for a given window in a rolling method but I am getting an error must be real number, not str.
It appears that the output must be of numeric type.
https://github.com/pandas-dev/pandas/issues/23002
df = pd.DataFrame({"a": range(1,10)})
df.head(10)
# Output
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
Tried:
to_csv = lambda x: x.to_csv(index=False)
# to_csv = lambda x: ",".join([str(d) for d in x])
df["running_csv"] = df.rolling(min_periods=1, window=3).apply(to_csv) # <= Causes Error
# Error:
# TypeError: must be real number, not str
Expected Output
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
Question: Is there any alternative way to get the CSV output like shown above?
Something like this?
>>> df['running_csv'] = pd.Series(df.rolling(min_periods=1, window=3)).apply(lambda x:x.a.values)
>>> df
a running_csv
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 4 [2, 3, 4]
4 5 [3, 4, 5]
5 6 [4, 5, 6]
6 7 [5, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
From here, further processing should be easy enough.
While it would be great to be able to do this using:
df['a'].astype(str).rolling(min_periods=1, window=3).apply(''.join)
as you mentioned, rolling currently does not work with strings
Here is one way:
(pd.DataFrame({i: df['a'].astype(str).shift(i) for i in range(3)[::-1]})
.fillna('')
.apply(','.join, axis=1)
.str.strip(',')
)
output:
0 1
1 1,2
2 1,2,3
3 2,3,4
4 3,4,5
5 4,5,6
6 5,6,7
7 6,7,8
8 7,8,9
Riding on the incomplete/partial solution by #fsimonjetz, we can complete it to generate the CSV values, as follows:
df['running_csv'] = (pd.Series(df.rolling(min_periods=1, window=3))
.apply(lambda x: x['a'].astype(str).values)
.str.join(',')
)
or further simplify it and enhance it to all vectorized functions, as follows:
df['running_csv'] = (pd.Series(df['a'].astype(str).rolling(min_periods=1, window=3))
.str.join(','))
Now, we have turned all (slow) .apply() and lambda functions to only (fast) vectorized functions.
Result:
print(df)
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9
pd.rolling doesn't work if the output is not a numeric. Multiple issues exist on github. However, it's possible to get the outcome:
to_str_list = lambda x: ','.join(x[x.notna()].astype(int).astype(str).values.tolist())
df['running_csv'] = pd.concat([df['a'].shift(2), df['a'].shift(1), df['a']], axis=1) \
.apply(to_str_list), axis=1)
>>> df
a running_csv
0 1 1
1 2 1,2
2 3 1,2,3
3 4 2,3,4
4 5 3,4,5
5 6 4,5,6
6 7 5,6,7
7 8 6,7,8
8 9 7,8,9

In terms of lowest computational complexity, how to create a new pandas column which increments by a certain number based on another column

I have this column of numbers, sorted by value
import pandas as pd
# initialize list of lists
data = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 6, 6, 6]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['First'])
df.head(19)
First
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 5
15 6
16 6
17 6
I would like to add a column, that increments by a certain number down the column if the column value is in a particular row is the same as the previous value. When going down the rows, and encountering a new row value in 'First', the increments restarts at 1 (or zero).
Here is an example dataframe of the result I am looking for.
First Second
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 3 1
7 3 2
8 3 3
9 3 4
10 3 5
11 3 6
12 4 1
13 4 2
14 5 1
15 6 1
16 6 2
17 6 3
What have tried so far
I tried extracting the column to a list, and using a loop to create the new column, which then can be appended to dataframe
colimnList = df['First'].tolist()
newColumn = []
old = -1
toAdd = 1
for item in colimnList:
if item == old:
toAdd+=1
newColumn.append(toAdd)
else:
toAdd=1
newColumn.append(toAdd)
old = item
newColumn
[1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5, 6, 1, 2, 1, 1, 2, 3]
Is there a method that is more efficient computationally, or at least more programmatically elegant? Possibly done in pure pandas?
Pandas has a "groupby" operation that does almost exactly what you need, called cumcount. It starts each group from zero, whereas you want to start from one, so just add one and you'll get the result you want:
df.groupby('First').cumcount() + 1

Appending data frames in Pandas

I have a 'for' loop that is calling a function (y) on each iteration. The function returns a 5 column by ten row dataframe called phstab.
for j in cycles
phstab=y(j)
The last column in the dataframe is the only one that changes. The value in the last column is the value for cycles. All the other values in the other columns all stay the same on each iteration. So if the loop iterates for time for example, it will produce four separate instances of phstab; each instance with a different value of cycles.
I'd like to append phstab on each iteration so so the output is just one long dataframe instead of four instances. I tried inserting the following statement in the loop but it didn't work
phstab=phstab.append(phstab)
How do I get one single dataframe instead of four separate instances ?
I'm assuming your y(j) returns something like this:
In [35]: def y(j):
...: return pd.DataFrame({'a': range(10),
...: 'b': range(10),
...: 'c': range(10),
...: 'd': range(10),
...: 'e_cycle' : j})
To iterate over this function, adding columns for each iterations, I'd do something like this. On the first pass, the dataframe is just set to phstab. On each subsequent iteration, a new column is added to phstab based on results of y(j).
I'm assuming you need to rename columns, if y(j) returns a unique column based on the value of j, you'll have to modify to fit.
In [38]: cycles = range(5)
In [38]: for i,j in enumerate(cycles):
...: if i == 0:
...: phstab = y(j)
...: phstab = phstab.rename(columns = {'e_cycle' : 'e_' + str(j)})
...: else:
...: phstab['e_' + str(j)] = y(j)['e_cycle']
In [38]: phstab
Out[38]:
a b c d e_0 e_1 e_2 e_3 e_4
0 0 0 0 0 0 1 2 3 4
1 1 1 1 1 0 1 2 3 4
2 2 2 2 2 0 1 2 3 4
3 3 3 3 3 0 1 2 3 4
4 4 4 4 4 0 1 2 3 4
5 5 5 5 5 0 1 2 3 4
6 6 6 6 6 0 1 2 3 4
7 7 7 7 7 0 1 2 3 4
8 8 8 8 8 0 1 2 3 4
9 9 9 9 9 0 1 2 3 4
[10 rows x 9 columns]
Edit:
Thanks for clarifying. To have the output in long format, you can you pd.concat, as below.
In [47]: pd.concat([y(j) for j in cycles], ignore_index=True)
Out[47]:
a b c d e_cycle
0 0 0 0 0 0
1 1 1 1 1 0
2 2 2 2 2 0
3 3 3 3 3 0
4 4 4 4 4 0
5 5 5 5 5 0
6 6 6 6 6 0
7 7 7 7 7 0
8 8 8 8 8 0
9 9 9 9 9 0
10 0 0 0 0 1
11 1 1 1 1 1
.....
[50 rows x 5 columns]
I believe a very simple solution is
my_dataframes = []
for j in cycles:
phstab = y(j)
my_dataframes.append(phstab)
full_dataframe = pd.concat(my_dataframes)
Alternatively and more concisely (credit to #chrisb above)
full_dataframe = pd.concat([y(j) for j in cycles], ignore_index=True)
pd.concat merges a list of dataframes together vertically. Ignoring the index is important so that the merged version doesn't retain the indices of the individual dataframes - otherwise you might end up with an index of [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3] when instead you'd want [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].

Key error on idxmax in df

I have a pandas grouped dataframe that I created with:
Prof_file=prof_claims.groupby(['TC_Code', 'Primary_CPT_Description'])
grp_prof=Prof_file['Total_Case_AMT'].agg([np.sum, np.mean, np.count_nonzero])
Now I want to find the longest string in the field 'Primary_CPT_Description'.
I am using
grp_prof.ix[grp_prof['Primary_CPT_Description'].idxmax()]
I have also tried
print grp_prof.groupby(['Primary_CPT_Description']).idxmax()
However I keep getting an error: KeyError: u'no item named Primary_CPT_Description'
That does not seem to make sense as I definitely have 'Primary_CPT_Description' as a string field in the df.
Your grp_prof probably looks something like this (manufactured data, but you get the idea):
>>> grp_prof
sum mean count_nonzero
TC_Code Primary_CPT_Description
0 5 15 15 1
1 6 16 16 1
2 7 17 17 1
3 8 18 18 1
4 9 19 19 1
See how TC_Code and Primary_CPT_Description are lower than sum, mean, and count_nonzero? They're not columns, they're part of the index:
>>> grp_prof.columns
Index([u'sum', u'mean', u'count_nonzero'], dtype='object')
>>> grp_prof.index
MultiIndex(levels=[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]],
labels=[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]],
names=[u'TC_Code', u'Primary_CPT_Description'])
I'd probably use .reset_index():
>>> grp_prof = grp_prof.reset_index()
>>> grp_prof
TC_Code Primary_CPT_Description sum mean count_nonzero
0 0 5 15 15 1
1 1 6 16 16 1
2 2 7 17 17 1
3 3 8 18 18 1
4 4 9 19 19 1
>>> grp_prof["Primary_CPT_Description"].idxmax()
4
To get the longest string, you need to get a Series of lengths first (new fake data):
>>> df["Primary_CPT_Description"]
0 A
1 BC
2 CDE
3 F
4 GH
Name: Primary_CPT_Description, dtype: object
>>> df["Primary_CPT_Description"].apply(len)
0 1
1 2
2 3
3 1
4 2
Name: Primary_CPT_Description, dtype: int64
>>> df["Primary_CPT_Description"].apply(len).idxmax()
2
>>> df["Primary_CPT_Description"].str.len().idxmax()
2

Categories

Resources