Selection of columns - python

I work with Pandas dataframe.I want to aggregate data by one column and after that to summarize other columns.You can see example below:
data = {'name': ['Company1', 'Company2', 'Company1', 'Company2', 'Company5'],
'income': [0, 180395, 4543168, 7543168, 73],
'turnover': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, columns = ['name', 'income', 'turnover'])
df
INCOME_GROUPED = df.groupby(['name']).agg({'income':sum,'turnover':sum})
So this code above work well and give good result. Now next step is selection. I want to select only to columns from INCOME_GROUPED dataframe.
INCOME_SELECT = INCOME_GROUPED[['name','income']]
But after execution this line of code I got this error:
"None of [Index(['name', 'income'], dtype='object')] are in the [columns]"
So can anybody help me how to solve this problem ?

You need to call reset_index() after agg():
INCOME_GROUPED = df.groupby(['name']).agg({'income':sum,'turnover':sum}).reset_index()
# ^^^^^^^^^^^^^^ add this
Output:
>>> INCOME_GROUPED[['name', 'income']]
name income
0 Company1 4543168
1 Company2 7723563
2 Company5 73

In this line
INCOME_SELECT = WAGE_GROUPED[['name','income']]
you probably meant INCOME_GROUPED instead of WAGE_GROUPED. Also, uppercase variable names are frowned upon unless global constants.
This only solves "income" being available as a column. By default, groupby makes the grouper column the index in the result; this is changable by passing as_index=False to it:
income_grouped = (df.groupby("name", as_index=False)
.agg({"income": "sum", "turnover": "sum"})
income_selected = income_grouped[["name", "income"]].copy()
Note also the copy at the very end to avoid the infamous SettingWithCopyWarning in possible future manipulations.
Last note is that since you aggregated all columns and did it with the same method ("sum"), you can go for
income_grouped = df.groupby("name", as_index=False).sum()

Related

PySpark: Create a subset of a dataframe for all dates

I have a DataFrame that has a lot of columns and I need to create a subset of that DataFrame that has only date values.
For e.g. my Dataframe could be:
1, 'John Smith', '12/10/1982', '123 Main St', '01/01/2000'
2, 'Jane Smith', '11/21/1999', 'Abc St', '12/12/2020'
And my new DataFrame should only have:
'12/10/1982', '01/01/2000'
'11/21/1999', '12/12/2000'
The dates could be of any format and could be on any column. I can use the dateutil.parser to parse them to make sure they are dates. But not sure how to call parse() on all the columns and only filter those that return true to another dataframe, easily.
If you know what you columns the datetimes are in it's easy:
pd2 = pd[["row_name_1", "row_name_2"]]
# or
pd2 = pd.iloc[:, [2, 4]]
You can find your columns' datatype by checking each tuple in your_dataframe.dtypes.
schema = "id int, name string, date timestamp, date2 timestamp"
df = spark.createDataFrame([(1, "John", datetime.now(), datetime.today())], schema)
list_of_columns = []
for (field_name, data_type) in df.dtypes:
if data_type == "timestamp":
list_of_columns.append(field_name)
Now you can use this list inside .select()
df_subset_only_timestamps = df.select(list_of_columns)
EDIT: I realized your date columns might be StringType.
You could try something like:
df_subset_only_timestamps = df.select([when(col(column).like("%/%/%"), col(column)).alias(column) for column in df.columns]).na.drop()
Inspired by this answer. Let me know if it works!

About Pandas Dataframe

I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.
You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Change order of list of lists according to another list

I have a bunch of CSV-files where first line is the column name, and now I want to change the order according to another list.
Example:
[
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
...
]
The above order differs slightly between the files, but the same column-names are always available.
So the I want the columns to be re-arranged as:
['index','date','name','position']
I can solve it by comparing the first row, making an index for each column, then re-map each row into a new list of lists using a for-loop.
And while it works, it feels so ugly even my blind old aunt would yell at me if she saw it.
Someone on IRC told me to look at on map() and operator but I'm just not experienced enough to puzzle those together. :/
Thanks.
Plain Python
You could use zip to transpose your data:
data = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233']
]
columns = list(zip(*data))
print(columns)
# [('date', '2003-02-04', '2003-02-04'), ('index', '23445', '23446'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
It becomes much easier to modify the columns order now.
To calculate the needed permutation, you can use:
old = data[0]
new = ['index','date','name','position']
mapping = {i:new.index(v) for i,v in enumerate(old)}
# {0: 1, 1: 0, 2: 2, 3: 3}
You can apply the permutation to the columns:
columns = [columns[mapping[i]] for i in range(len(columns))]
# [('index', '23445', '23446'), ('date', '2003-02-04', '2003-02-04'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
and transpose them back:
list(zip(*columns))
# [('index', 'date', 'name', 'position'), ('23445', '2003-02-04', 'Steiner, James', '98886'), ('23446', '2003-02-04', 'Holm, Derek', '2233')]
With Pandas
For this kind of tasks, you should use pandas.
It can parse CSVs, reorder columns, sort them and keep an index.
If you have already imported data, you could use these methods to import the columns, use the first row as header and set index column as index.
import pandas as pd
df = pd.DataFrame(data[1:], columns=data[0]).set_index('index')
df then becomes:
date name position
index
23445 2003-02-04 Steiner, James 98886
23446 2003-02-04 Holm, Derek 2233
You can avoid those steps by importing the CSV correctly with pandas.read_csv. You'd need usecols=['index','date','name','position'] to get the correct order directly.
Simple and stupid:
LIST = [
['date', 'index', 'name', 'position'],
['2003-02-04', '23445', 'Steiner, James', '98886'],
['2003-02-04', '23446', 'Holm, Derek', '2233'],
]
NEW_HEADER = ['index', 'date', 'name', 'position']
def swap(lists, new_header):
mapping = {}
for lst in lists:
if not mapping:
mapping = {
old_pos: new_pos
for new_pos, new_field in enumerate(new_header)
for old_pos, old_field in enumerate(lst)
if new_field == old_field}
yield [item for _, item in sorted(
[(mapping[index], item) for index, item in enumerate(lst)])]
if __name__ == '__main__':
print(LIST)
print(list(swap(LIST, NEW_HEADER)))
To rearrange your data, you can use a dictionary:
import csv
s = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
]
new_data = [{a:b for a, b in zip(s[0], i)} for i in s[1:]]
final_data = [[b[c] for c in ['index','date','name','position']] for b in new_data]
write = csv.writer(open('filename.csv'))
write.writerows(final_data)

What's the idiomatic way to perform an aggregate and rename operation in pandas

For example, how do you do the following R data.table operation in pandas:
PATHS[,.( completed=sum(exists), missing=sum(not(exists)), total=.N, 'size (G)'=sum(sizeMB)/1024), by=.(projectPath, pipelineId)]
I.e. group by projectPath and pipelineId, aggregate some of the columns
using possibly custom functions, and then rename the resulting columns.
Output should be a DataFrame with no hierarchical indexes, for example:
projectPath pipelineId completed missing size (G)
/data/pnl/projects/TRACTS/pnlpipe 0 2568 0 45.30824
/data/pnl/projects/TRACTS/pnlpipe 1 1299 0 62.69934
You can use groupby.agg:
df.groupby(['projectPath', 'pipelineId']).agg({
'exists': {'completed': 'sum', 'missing': lambda x: (~x).sum(), 'total': 'size'},
'sizeMB': {'size (G)': lambda x: x.sum()/1024}
})
Sample run:
df = pd.DataFrame({
'projectPath': [1,1,1,1,2,2,2,2],
'pipelineId': [1,1,2,2,1,1,2,2],
'exists': [True, False,True,True,False,False,True,False],
'sizeMB': [120032,12234,223311,3223,11223,33445,3444,23321]
})
df1 = df.groupby(['projectPath', 'pipelineId']).agg({
'exists': {'completed': 'sum', 'missing': lambda x: (~x).sum(), 'total': 'size'},
'sizeMB': {'size (G)': lambda x: x.sum()/1024}
})
​
df1.columns = df1.columns.droplevel(0)
​
df1.reset_index()
Update: if you really want to customize the aggregation without using the deprecated nested dictionary syntax, you can always use groupby.apply and return a Series object from each group:
df.groupby(['projectPath', 'pipelineId']).apply(
lambda g: pd.Series({
'completed': g.exists.sum(),
'missing': (~g.exists).sum(),
'total': g.exists.size,
'size (G)': g.sizeMB.sum()/1024
})
).reset_index()
I believe the new 0.20, more "idiomatic" way, is like this (where the second layer of the nested dictionary is basically replaced by an appended .rename method):
...( completed=sum(exists), missing=sum(not(exists)), total=.N, 'size (G)'=sum(sizeMB)/1024), by=.(projectPath, pipelineId)]... in R, becomes
EDIT: use as_index=False in pd.DataFrame.groupby() to prevent a MultiIndex in final df
df.groupby(['projectPath', 'pipelineId'], as_index=False).agg({
'exists': 'sum',
'pipelineId': 'count',
'sizeMB': lambda s: s.sum() / 1024
}).rename(columns={'exists': 'completed',
'pipelineId': 'total',
'sizeMB': 'size (G)'})
And then I might just add another line for the inverse of 'exists' -> 'missing':
df['missing'] = df.total - df.completed
As an example in a Jupyter notebook test, below there's a mock directory tree of 46 total pipeline paths imported by pd.read_csv() into a Pandas DataFrame, and I slightly modified the example in question to have random data in the form of DNA strings between 1,000-100k nucleotide bases, in lieu of creating Mb-size files. Non-discrete gigabases are still calculated though, using NumPy's np.mean() on the aggregated pd.Series object available within the df.agg call for elaboration of the process but lambda s: s.mean() would be the simpler way to do it.
e.g.,
df_paths.groupby(['TRACT', 'pipelineId']).agg({
'mean_len(project)' : 'sum',
'len(seq)' : lambda agg_s: np.mean(agg_s.values) / 1e9
}).rename(columns={'len(seq)': 'Gb',
'mean_len(project)': 'TRACT_sum'})
where "TRACT" was a category one-level higher to "pipelineId" in the dir tree, such that in this example you can see there's 46 total unique pipelines — 2 "TRACT" layers AB/AC x 6 "pipelineId"/"project"'s x 4 binary combinations 00, 01, 10, 11 (minus 2 projects which GNU parallel made into a third topdir; see below). So in the new agg the stats transformed the mean of project-level into the sums of all respective projects agg'd per-TRACT.
df_paths = pd.read_csv('./data/paths.txt', header=None, names=['projectPath'])
# df_paths['projectPath'] =
df_paths['pipelineId'] = df_paths.projectPath.apply(
lambda s: ''.join(s.split('/')[1:5])[:-3])
df_paths['TRACT'] = df_paths.pipelineId.apply(lambda s: s[:2])
df_paths['rand_DNA'] = [
''.join(random.choices(['A', 'C', 'T', 'G'],
k=random.randint(1e3, 1e5)))
for _ in range(df_paths.shape[0])
]
df_paths['len(seq)'] = df_paths.rand_DNA.apply(len)
df_paths['mean_len(project)'] = df_paths.pipelineId.apply(
lambda pjct: df_paths.groupby('pipelineId')['len(seq)'].mean()[pjct])
df_paths

Categories

Resources