I want to create a plot from a large Pandas dataframe. The data is in the following format
Type Number ...unimportant additional columns
Foo 13 ...
Foo 25 ...
Foo 56 ...
Foo 56 ...
Bar 10 ...
Bar 10 ...
Bar 11 ...
Bar 23 ...
I need to count the number of elements from column 'Number' in a sliding window from x to x+i to determine the number of values falling in each sliding window bucket.
For example, if the window size is i=10, starting at x=0, and incrementing x by 1 each step, the sliding window bucket for 'Foo' a correct result for the above example would be:
Foo Bar
0 0 2 #(0-10)
1 0 3 #(1-11)
2 0 3 #(2-12)
3 1 3 #(3-13)
4 1 3 #(4-14)
.
.
.
20 1 1 #(13-23)
21 0 1 #(14-24)
22 1 1 #(15-25)
.
.
.
The answer would have df.max().max - [Window Length] rows, and len(df.columns) columns.
Toy code to generate a similar dataframe might be the following:
import pandas as pd
import numpy as np
str_arr = ['Foo','Bar','Python','PleaseHelp']
data1 = np.matrix(np.random.choice(str_arr, 100, p=[0.5, 0.1, 0.1, 0.3])).T
data2 = np.random.randint(100, size=(100,1))
merge = np.concatenate((data1,data2), axis=1)
df = pd.DataFrame(merge, index=range(100), columns=['Type','Number'])
df.sort_values(['Type','Number'], ascending=[True,True], inplace=True)
df = df.reset_index(drop=True)
How can I generate such a list efficiently?
Edit Note: Thanks to FLab who answered my question earlier before I clarified my question.
Here is my proposed solution.
For convenience, let's force 'Number' column to be an int.
df['Number'] = df['Number'].astype(int)
Define all possible ranges:
len_wdw = 10
all_ranges = [(i, i+len_wdw) for i in range(df['Number'].max()-len_wdw)]
And now check how many observations there are for "Number" in each of this ranges:
def get_mask(df, rg):
#rg is a range, e.g. (10-20)
return (df['Number'] >= rg[0]) & (df['Number'] <= rg[1])
result = pd.concat({ rg[0] :
df[get_mask(df, rg)].groupby('Type').count()['Number']
for rg in all_ranges},
axis = 1).fillna(0).T
For the randomly generated numbers, this gives:
Bar Foo PleaseHelp Python
0 1.0 4.0 3.0 1.0
1 1.0 5.0 2.0 1.0
2 1.0 5.0 3.0 1.0
3 1.0 4.0 3.0 0.0
4 1.0 3.0 3.0 1.0
.....
85 2.0 3.0 4.0 1.0
86 1.0 3.0 3.0 1.0
87 1.0 4.0 3.0 1.0
88 1.0 4.0 4.0 1.0
89 1.0 3.0 5.0 1.0
Related
I wrote a function that outputs 3 lists and want to make those lists each a column in a dataframe.
The function returns a tuple of 3 lists, containing text or lists of text.
Here is the function:
def function(pages = 0):
a = [title for title in range(pages)]
b = [[summary] for summary in title.summary]
c = [[summary2] for summary2 in title.summary2]
return a, b, c
data = function(pages = 2)
pd.DataFrame(data, columns = ['A', 'B', 'C'])
and the error says that I passed data with 2 columns while the columns have 3 columns. Can someone explain what is going on and how to fix it? Thank you!
One of the way you can address this transpose the output and then create the dataframe.
A sample example:
import pandas as pd
import numpy as np
def function(pages=0):
# Replace this with your logic
a=list(range(10))
b=[i*0.9 for i in a]
c=[i*0.5 for i in a]
return [a,b,c]
data=np.array(function()).T.tolist()
df=pd.DataFrame(data=data,columns=['A','B','C'])
Output:
In []: df
Out[25]:
A B C
0 0.0 0.0 0.0
1 1.0 0.9 0.5
2 2.0 1.8 1.0
3 3.0 2.7 1.5
4 4.0 3.6 2.0
5 5.0 4.5 2.5
6 6.0 5.4 3.0
7 7.0 6.3 3.5
8 8.0 7.2 4.0
9 9.0 8.1 4.5
I have a simple pandas DataFrame where I need to add a new column that shows the 'count' of occurrences for the 'current_price' in a range of other columns 'pricemonths', that match the current_price column:
import pandas as pd
import numpy as np
# my data
data = {'Item':['Bananas', 'Apples', 'Pears', 'Avocados','Grapes','Melons'],
'Jan':[1,0.5,1.1,0.6,2,4],
'Feb':[0.9,0.5,1,0.6,2,5],
'Mar':[1,0.6,1,0.6,2.1,6],
'Apr':[1,0.6,1,0.6,2,5],
'May':[1,0.5,1.1,0.6,2,5],
'Current_Price':[1,0.6,1,0.6,2,4]
}
# import my data
df = pd.DataFrame(data)
pricemonths=['Jan','Feb','Mar','Apr','May']
Thus, my final dataframe would contain another column ('times_found') with the values:
'times_found'
4
2
3
5
4
1
One way of doing it is to transpose the prices part of df, then use eq to compare with "Current_Price" across indices (which creates a boolean DataFrame with True for matching prices and False otherwise) and find sum across rows:
df['times_found'] = df['Current_Price'].eq(df.loc[:,'Jan':'May'].T).sum(axis=0)
or use numpy broadcasting:
df['times_found'] = (df.loc[:,'Jan':'May'].to_numpy() == df[['Current_Price']].to_numpy()).sum(axis=1)
Excellent suggestion from #HenryEcker: DataFrame equals on an axis may be faster than transposing for larger DataFrames:
df['times_found'] = df.loc[:, 'Jan':'May'].eq(df['Current_Price'], axis=0).sum(axis=1)
Output:
Item Jan Feb Mar Apr May Current_Price times_found
0 Bananas 1.0 0.9 1.0 1.0 1.0 1.0 4
1 Apples 0.5 0.5 0.6 0.6 0.5 0.6 2
2 Pears 1.1 1.0 1.0 1.0 1.1 1.0 3
3 Avocados 0.6 0.6 0.6 0.6 0.6 0.6 5
4 Grapes 2.0 2.0 2.1 2.0 2.0 2.0 4
5 Melons 4.0 5.0 6.0 5.0 5.0 4.0 1
So I am working in Python trying to change the index of my dataframe.
Here is my code:
df = pd.read_csv("data_file.csv", na_values=' ')
table = df['HINCP'].groupby(df['HHT'])
print(table.describe()[['mean', 'std', 'count', 'min', 'max']].sort_values('mean', ascending=False))
Here is the dataframe currently:
mean std count min max
HHT
1.0 106790.565562 100888.917804 25495.0 -5100.0 1425000.0
5.0 79659.567376 74734.380152 1410.0 0.0 625000.0
7.0 69055.725901 63871.751863 1193.0 0.0 645000.0
2.0 64023.122122 59398.970193 1998.0 0.0 610000.0
3.0 49638.428821 48004.399101 5718.0 -5100.0 609000.0
4.0 48545.356298 60659.516163 5835.0 -5100.0 681000.0
6.0 37282.245015 44385.091076 8024.0 -11200.0 676000.0
I want the index values to be like this instead of the numbered 1,2,...,7:
Married couple household
Nonfamily household:Male
Nonfamily household:Female
Other family household:Male
Other family household:Female
Nonfamily household:Male
Nonfamily household:Female
I tried using a set_index() as an attribute of table, where I set the key equal to a list of the index above that I want, but this gives me this error:
AttributeError: 'SeriesGroupBy' object has no attribute 'set_index'
I was also wondering if there was any way to alter the HHT label at the top of the index, or will that come with changing the index values
>>> df = pd.DataFrame(columns = ["HHT", "HINC"], data = np.transpose([[2,3,2,2,2,3,3,3,4], [1,1,3,1,4,7,8,9,11]]))
>>> df
HHT HINC
0 2 1
1 3 1
2 2 3
3 2 1
4 2 4
5 3 7
6 3 8
7 3 9
8 4 11
>>> table = df['HINC'].groupby(df['HHT'])
>>> td = table.describe()
>>> df2 = pd.DataFrame(td)
>>> df2.index = ['lab1', 'lab2', 'lab3']
>>> df2
count mean std min 25% 50% 75% max
lab1 4.0 2.25 1.500000 1.0 1.0 2.0 3.25 4.0
lab2 4.0 6.25 3.593976 1.0 5.5 7.5 8.25 9.0
lab3 1.0 11.00 NaN 11.0 11.0 11.0 11.00 11.0
I have a dataframe,
foo column1 column2 ..... column9999
0 5 0.8 0.01
1 10 0.9 0.01
2 15 0.2 1.2
3 8 0.12 0.5
4 74 0.78 0.7
. ... ...
Based on this existing columns, I want to create new column.
If I go one by one, it would be like this,
df["A1"] = df.foo[df["column1"] > 0.1].rank(ascending=False)
df.A1.fillna(value=0, inplace=True)
df['new_A1'] = (1+df['A1'])
df['log_A1'] = np.log(df.['new_A1'])
But, I don't want to write down all columns(>900 columns).
How can I iterate and create new columns?
Thanks in advance!
Here's a cleaned up version of what I think you are trying to do:
# Include only variables with the "column" stub
cols = [c for c in df.columns if 'column' in c]
for i, c in enumerate(cols):
a = f"A{i+1}"
df[a] = 1 + df.loc[df[c] > 0.1, 'foo'].rank(ascending=False)
df[f'log_{a}'] = np.log(df[a]).fillna(value=0)
I'm assuming that you didn't need the variable new_A# column and was just using it as an intermediate column for the log calculation.
You can iterate through the different column names and perform the +1 and the logoperations. When you use df.columns, you then receive a list of the different column headers. So you can do something like this for example:
for index, column in enumerate(df.columns):
df['new_A' + str(index)] = (1+df[column])
df['log_A' + str(index)] = np.log(df['new_A' + str(index)])
You can add the rest of the operations too inside the same loop.
Hope it helps
You can just do:
import pandas as pd
import numpy as np
df = pd.read_csv('something.csv')
a = ['A'+str(i) for i in range(1, len(df.columns.values))]
b = [x for x in df.columns.values if x != 'foo']
to_create = list(zip(b, a))
for create in to_create:
df[create[1]] = df.foo[df[create[0]] > 0.1].rank(ascending=False)
df['new_'+create[1]] = (1+df[create[1]])
df['log_'+create[1]] = np.log(df['new_'+create[1]])
print(df.fillna(value=0))
which outputs:
foo column1 column2 A1 new_A1 log_A1 A2 new_A2 log_A2
0 5 0.80 0.01 5.0 6.0 1.791759 0.0 0.0 0.000000
1 10 0.90 0.01 3.0 4.0 1.386294 0.0 0.0 0.000000
2 15 0.20 1.20 2.0 3.0 1.098612 2.0 3.0 1.098612
3 8 0.12 0.50 4.0 5.0 1.609438 3.0 4.0 1.386294
4 74 0.78 0.70 1.0 2.0 0.693147 1.0 2.0 0.693147
I have two DataFrames:
First one (sp_df)
X Y density keep mass size
10 20 33 False 23 23
3 2 52 True 5 5
1.2 3 35 False 25 52
Second one (ep_df)
X Y density keep mass size
2.1 1.1 55 True 4.0 4.4
1.1 2.9 60 False 24.8 54.8
9.0 25.0 33 False 22.0 10.0
now i need to merge them with their X/Y Position into something like this:
X-SP Y-SP density-SP ........ X-EP Y-EP density-EP......
1.5 2.0 30 1.0 2.4 28.7
So with the Data shown above you would get something like this:
X-SP Y-SP density-SP keep-SP mass-SP size-SP X-EP Y-EP density-EP keep-EP mass-EP size-EP
3 2 52 True 5 5 2.1 1.1 55 True 4.0 4.4
1.2 3 35 False 25 52 1.1 2.9 60 False 24.8 54.8
10 20 33 False 23 23 9.0 25.0 33 False 22.0 10.0
My Problem is now that those values are not frequently alike. So I need some kind comparison what two columns in the different dataframes are most likely to be the same. Unfortunately, I have no idea how I can get this done.
Any Tips, advice? Thanks in advance
you can merge the two dataframes like a cartesian product. This will make a dataframe with each row of first data frame joined with every row of second data frame. Than remove the rows which have more difference between X values of the two dataframes. Hope the following code helps,
import pandas as pd
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1].tolist()
df=df.drop(df.index[drop])
df
Edited code:
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1.01].tolist()
drop_negative=df.index[df["diff"] <= 0 ].tolist()
droped_values=drop+drop_negative
df=df.drop(df.index[droped_values])
df