I'm creating a function to filter many dataframes using groupby. The dataframes look like below. However each dataframe does not always contain the same number of columns.
df = pd.DataFrame({
'xyz CODE': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10] })
For each dataframe I always apply groupby to the first column - which are named differently across dataframes. All other columns are named consistently across all dataframes.
My question: Is it possible to run groupby using a combination of column location and column names? How can I do it?
I wrote the following function and got an error TypeError: unhashable type: 'list'
def filter_all_df(df):
df['max_c'] = df.groupby(df.columns[0])['a'].transform('max')
newdf = df[df['a'] == df['max_c']].drop(['max_c'], axis=1)
newdf['max_score'] = newdf.groupby([newdf.columns[0],'a','b'])['c'].transform('max')
newdf = newdf[newdf['c'] == newdf['max_score']]
newdf = newdf.sort_values([newdf.columns[0]]).drop_duplicates([newdf.columns[0], 'a','b', 'c'], keep='last')
newdf.to_csv('newdf_all.csv')
return newdf
Related
I have a fairly large dataframe which I am trying to combine the columns of in a very specific manner. The original dataframe has 2150 columns and the final dataframe should have around 500 by taking the average of some spread of columns to produce new column. The spread changes which is why I have tried a list which has the start of each column group.
My actual code gets the desired results. However, with the warning,
"PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
df1[str(val)] = df[combine].mean(axis=1)"
I have cannot think of a smart way to use concat for one single combine at the end whilst still taking the mean of each group. I am also new to writing code and any corrections to my style would be appreciated, especially where I have to break out of the loop.
Here is my actual code.
import pandas as pd
df = pd.read_csv("some file location")
new_cols = list(range(350, 702, 3)) + list(range(707, 1398, 6)) + \
list(range(1407, 2098, 10)) + list(range(2112, 2488, 15)) + [2501]
cols = list(map(int, list(df.columns)[1:]))
df1 = df.copy()
for i, val in enumerate(new_cols):
if val == 2501:
break
combine = list(map(str, range(new_cols[i], new_cols[i+1])))
print(combine)
df1 = df1.drop(combine, axis=1, inplace=False)
df1[str(val)] = df[combine].mean(axis=1)
df1.to_csv("data_reduced_precision.csv", index=False)
print("Finished")
Here is a minimal example which shows what I am trying to achieve. It doesn't produce the PerformanceWarning as it has only a few columns. But illustrates my method I hope.
df1 = pd.DataFrame({'1': [1, 2, 3, 4],
'2': [5, 6, 7, 8],
'3': [9, 10, 11, 12],
'4': [13, 14, 15, 16],
'5': [17, 18, 19, 20],
'6': [21, 22, 23, 24],
'7': [25, 26, 27, 28]})
df2 = df1.copy()
# df2 should have columns 1,2,5 which are the mean of df1 columns [1],[2,3,4],[5,6,7]
new_cols = [1, 2, 5, 8]
for i, val in enumerate(new_cols):
if val == 8:
break
#All the column names are integers as str
combine = list(map(str, range(new_cols[i], new_cols[i+1])))
df2 = df2.drop(combine, axis=1, inplace=False)
df2[str(val)] = df1[combine].mean(axis=1)
print(df2)
1 2 5
0 1.0 9.0 21.0
1 2.0 10.0 22.0
2 3.0 11.0 23.0
3 4.0 12.0 24.0
I would move your dataframe operations out of your for-loop.
import pandas
df1 = pandas.DataFrame({
'1': [1, 2, 3, 4],
'2': [5, 6, 7, 8],
'3': [9, 10, 11, 12],
'4': [13, 14, 15, 16],
'5': [17, 18, 19, 20],
'6': [21, 22, 23, 24],
'7': [25, 26, 27, 28],
})
# df2 should have columns 1,2,5 which are the mean of df1 columns [1],[2,3,4],[5,6,7]
new_cols = [1, 2, 5, 8]
combos = []
for i, val in enumerate(new_cols):
if val != 8:
#All the column names are integers as str
combos.append(list(map(str, range(new_cols[i], new_cols[i+1]))))
df2 = df1.assign(**{
str(maincol): df1.loc[:, combo].mean(axis="columns")
for maincol, combo in zip(new_cols, combos)
}).loc[:, map(str, new_cols[:-1])]
Unless I'm mistaken, this will pass around references to the original df1 instead of making a bunch of copies (i.e., df2 = df2.drop(...).
Printing out df1, I get:
1 2 5
0 1.0 9.0 21.0
1 2.0 10.0 22.0
2 3.0 11.0 23.0
3 4.0 12.0 24.0
If I scale this up to a 500,000 x 20 dataframe, it completes seemingly instantly without warning on my machine:
import numpy
dfbig = pandas.DataFrame(
data=numpy.random.normal(size=(500_000, 20)),
columns=list(map(str, range(1, 21)))
)
new_cols = [1, 2, 5, 8, 12, 13, 16, 17, 19]
combos = []
for i, val in enumerate(new_cols[:-1]):
combos.append(list(map(str, range(new_cols[i], new_cols[i+1]))))
dfbig2 = dfbig.assign(**{
str(maincol): dfbig.loc[:, combo].mean(axis="columns")
for maincol, combo in zip(new_cols, combos)
}).loc[:, map(str, new_cols[:-1])]
display (df)
display(prices)
I have 2 dataframes, I want to replace the month numbers in dataframe 1 with the DA HB West value for that month. It also has to have the same cheat code as the df.
I feel like this is really easy to do but I keep getting an error.
The error reads "ValueError: Can only compare identically-labeled Series objects"
With a sample of your data:
df2 = pd.DataFrame({"Month": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"DA HB West": np.random.random(12),
"Year": [2019]*12,
"Cheat": ["2019PeakWE"]*12})
df = pd.DataFrame({"Month1": [7, 7, 7, 9, 11],
"Month2": [8, 8, 8, 10, 12],
"Month3": [9.0, 9.0, 9.0, 11.0, np.nan],
"Cheat4": ["2019PeakWE"]*5})
df.columns = df.columns.str[:-1]
Fill the na values so that there isn't an error with changing value types to integers:
df.fillna(0, inplace=True)
Map all but the last column:
d = {}
for i, j in df.groupby("Cheat"):
mapping = df2[df2["Cheat"] == i].set_index("Month")["DA HB West"].to_dict()
d[i] = j
d[i].iloc[:, :-1] = j.iloc[:, :-1].astype(int).apply(lambda x: x.map(mapping))
This creates a dictionary of all the different Cheats.
You can then append them all together, if you need to.
I am pretty new to Python and hence I need your help on the following:
I have two tables (dataframes):
Table 1 has all the data and it looks like that:
GenDate column has the generation day.
Date column has dates.
Column D and onwards has different values
I also have the following table:
Column I has "keywords" that can be found in the header of Table 1
Column K has dates that should be in column C of table 1
My goal is to produce a table like the following:
I have omitted a few columns for Illustration purposes.
Every column on table 1 should be split base on the Type that is written on the Header.
Ex. A_Weeks: The Weeks corresponds to 3 Splits, Week1, Week2 and Week3
Each one of these slits has a specific Date.
in the new table, 3 columns should be created, using A_ and then the split name:
A_Week1, A_Week2 and A_Week3.
for each one of these columns, the value that corresponds to the Date of each split should be used.
I hope the explanation is good.
Thanks
You can get the desired table with the following code (follow comments and check panda api reference to learn about functions used):
import numpy as np
import pandas as pd
# initial data
t_1 = pd.DataFrame(
{'GenDate': [1, 1, 1, 2, 2, 2],
'Date': [10, 20, 30, 10, 20, 30],
'A_Days': [11, 12, 13, 14, 15, 16],
'B_Days': [21, 22, 23, 24, 25, 26],
'A_Weeks': [110, 120, 130, 140, np.NaN, 160],
'B_Weeks': [210, 220, 230, 240, np.NaN, 260]})
# initial data
t_2 = pd.DataFrame(
{'Type': ['Days', 'Days', 'Days', 'Weeks', 'Weeks'],
'Split': ['Day1', 'Day2', 'Day3', 'Week1', 'Week2'],
'Date': [10, 20, 30, 10, 30]})
# create multiindex
t_1 = t_1.set_index(['GenDate', 'Date'])
# pivot 'Date' level of MultiIndex - unstack it from index to columns
# and drop columns with all NaN values
tt_1 = t_1.unstack().dropna(axis=1)
# tt_1 is what you need with multi-level column labels
# map to rename columns
t_2 = t_2.set_index(['Type'])
mapping = {
type_: dict(zip(
t_2.loc[type_, :].loc[:, 'Date'],
t_2.loc[type_, :].loc[:, 'Split']))
for type_ in t_2.index.unique()}
# new column names
new_columns = list()
for letter_type, date in tt_1.columns.values:
letter, type_ = letter_type.split('_')
new_columns.append('{}_{}'.format(letter, mapping[type_][date]))
tt_1.columns = new_columns
I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.
For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).
A simple example of both data frame would be
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})
in this case the solution would be
df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})
Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.
Moreover, I have also seen this question
but this assumes that one data frame is the subset of another which is not necessarily true in my case.
Here's one way:
import pandas as pd, numpy as np
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})
x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}
pd.DataFrame(list(x), columns=['col1', 'col2'])
If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].
The reason for the complication is that np.nan == np.nan is considered False.
Here is on solution
pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]:
col1 col2
0 1 10.0
1 2 11.0
I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df
You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.
My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data
My favorite way is:
df = df[0:0]
Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)
If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])
This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()