creating DataFrame from a Data Series - python

I have a data series,
df,
primary
Buy 484
Sell 429
Blanks 130
FX Spot 108
Income 77
FX Forward 2
trying to crate a dataframe with 2 column.
first column values should be the index of df
second column should have the values of primary in df
by using,
filter_df=pd.DataFrame({'contents':df.index, 'values':df.values})
I get,
Exception: Data must be 1-dimensional

Use reset_index with rename_axis for new column name:
filter_df = df.rename_axis('content').reset_index()
Another solution with rename:
filter_df = df.reset_index().rename(columns={'index':'content'})
For DataFrame from constructor need df['primary'] for select column
filter_df=pd.DataFrame({'contents':df.index, 'values':df['primary'].values})
print (filter_df)
content primary
0 Buy 484
1 Sell 429
2 Blanks 130
3 FX Spot 108
4 Income 77
5 FX Forward 2

Related

Missing rows when merging pandas dataframes?

I am trying to merge the following two dataframes on the date price and code columns, but there is unexpected loss of rows:
df1.head(5)
date price code
0 2015-01-03 27.65 534
1 2015-03-30 43.65 231
2 2015-01-09 24.65 132
3 2015-11-13 87.13 211
4 2015-05-12 63.25 231
df2.head(5)
date price code volume
0 2015-01-03 27.65 534 43
1 2015-03-30 43.65 231 21
2 2015-01-09 24.65 132 88
3 2015-12-25 0.00 211 130
4 2015-03-12 11.15 211 99
df1 is a subset of df2, but without the volume column.
I tried the following code, but half of the rows disappear:
master = pd.merge(df1, df2, on=['date', 'price', 'code'])
I know I can do a left join and keep all the rows, but they will be NaN for volume, which is the entire purpose of doing the join.
I'm not sure why there are missing rows after the join, as df2 contains every single date, with consistent prices and codes as with df1. They also have the same data types, where dates are strings.
df1.dtypes
date object
price float64
code int
dtype: object
df2.dtypes
date object
price float64
code int
volume int
dtype: object
When I check if a certain value exists in a dataframe, it returns False, which could be a clue as to why they don't join as expected (unless I'm checking wrong):
df2['price'][0]
> 27.65
27.65 in df2['price']
> False
I'm at a loss about what to check next - any ideas? I think there could be some sort of encoding issues or something along those lines.
I also manually checked the dataframes for the rows which weren't able to be joined and the values definitely exist.
Edit: I know it's expected for the rows to be dropped when they do not match on the columns specified in the join. The problem is that rows are dropped even when they do match (or they appear to be).
You say "df1 is a subset of df2, but without the volume column". I suppose you don't mean that their only difference is the "volume" column, which is obviously not the only difference, otherwise the question would be trivial (of just adding the "volume" column).
You just need to add the how='outer' keyword argument.
df = pd.merge(df1, df2, on=['date', 'code', 'price'], how='outer')
With your dataframe, since your column naming is consistent, you can do the following:
master = pd.merge(df1,df2, how='left')
By default, pd.merge() joins on all common columns and performs an inner join. Use how to specify the left join.
The value disappears because row 3 and 4 for column price has different value (Also, row 4 for the code column). merge default behavior is inner therefore those rows will disappear.

pandas average across dynamic number of columns

I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)

DataFrame Pivot Table run specific function on specific columns

I have an dataset that includes the below information. I'd like to write a pivot table that counts the number of days from the Date Column, and then runs sum on the Impression, Clicks, Conversions, and Budget Delivered columns. Essentially, I'd like a summary of the table
Date Impressions Clicks Conversions Budget Delivered
0 1/1/2019 11,506,995 1,672 88 $12,124.14
1 1/2/2019 9,394,458 1,516 179 $9,838.45
2 1/3/2019 4,696,388 878 129 $6,858.67
3 1/4/2019 8,987,784 1,179 107 $9,566.55
4 1/5/2019 8,923,751 1,171 88 $9,322
I am having trouble figuring out how to return this single row DataFrame. I am trying to use pivot_table method but the groupby parameter is not returning the desired result. Not sure how to approach this issue.
from datatable import dt, f, by
df = dt.Frame("""
Date Impressions Clicks Conversions Budget Delivered
1/1/2019 11,506,995 1,672 88 $12,124.14
1/2/2019 9,394,458 1,516 179 $9,838.45
1/3/2019 4,696,388 878 129 $6,858.67
1/4/2019 8,987,784 1,179 107 $9,566.55
1/5/2019 8,923,751 1,171 88 $9,322
""")
budget = df['Budget'].to_list()[0]
budget = [float(x.replace('$', '').replace(',', '')) for x in budget]
df['Budget'] = dt.Frame(budget)
df[:, dt.sum(f[1:6])]
| Impressions Clicks Conversions Budget Delivered
-- + ----------- ------ ----------- ------- ---------
0 | 43509376 6416 591 47709.8 0
The main problem is string cleaning. As it stands, your input DataFrame contains mostly strings because of the non-numeric characters like /, ,, and $. The first step is to clean the data and convert it to a summable type such as int or float. Then we can sum all rows.
For non-numeric fields that should be counted rather than summed ('Date'), we replace those summed strings with counts.
Also, not sure you need a single row DataFrame when a Series would have sufficed, but since it was in the requirements I did that too.
It's inelegant, but it works:
def clean(x):
if isinstance(x, str):
return(x.replace('$', '').replace(',', ''))
return(x)
data_df['Budget Delivered'] = data_df['Budget Delivered'].apply(clean).astype('float')
col_names_to_intify = ['Impressions', 'Clicks', 'Conversions']
for col in col_names_to_intify:
data_df[col] = data_df[col].apply(clean).astype('int')
sum_df = data_df.sum().to_frame().T
for col in data_df.columns:
if data_df[col].dtypes.str == '|O':
sum_df[col] = data_df[col].count()
which gives sum_df as
Date Impressions Clicks Conversions Budget_Delivered
0 5 43509376 6416 591 47709.81

Can I "flatten" a column with empty cells?

For example, I have a dataframe below with multiple columns and rows in which the last column only has data for some of the rows. How can I take that last column and write it to a new dataframe while removing the empty cells that would remain if I just copied the entire column?
Part Number Count Miles
2345125 14 543
5432545 12
6543654 6 112
6754356 22
5643545 6
7657656 8 23
7654567 11 231
3455434 34 112
The data frame I want to obtain would be below
Miles
543
112
23
231
112
I've tried converting the empty cells to NaN and then removing, but I always either get a key error or fail to remove the rows I want. Thanks for any help.
# copy the column
series = df['Miles']
# drop nan values
series = series.dropna()
# one-liner
series = df['Miles'].dropna()
Do you mean:
df.loc[df.Miles.notna(), 'Miles']
Or if you want to drop the rows:
df = df[df.Miles.notna()]

Itering through DataFrame columns and deleting a row if cell value is not a number

I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])

Categories

Resources