adding values from two different rows into one using pyspark - python

I have two rows with the exact same data but columns changing between those two rows:
id
product
class
cost
1
table
large
5.12
1
table
medium
2.20
so I'm trying to get the following:
id
product
class
cost
1
table
large, Medium
7.32
I'm currently using the following code to get this:
df.groupBy("id", "product").agg(collect_list("class"),
(
F.sum("cost")
).alias("Sum")
The issue with this snippet code is that when doing the grouping is the first value it finds in class, and the addition doesn't seem to be correct (I'm not sure if it because is getting the first value and adding it the times it encounters class on that same id throughout the rows), so I'm getting something like this
id
product
class
cost
1
table
large, large
10.24
this is another snippet code I used, so I could get all my other fields while performing the addition on those two columns:
df.withColumn("total", F.sum("cost").over(Window.partitionBy("id")))
will it be the same to apply the F.array_join() function ?

You need to use the array_join function to join the results of collect_list with commas (,).
df = df.groupBy('id', 'product').agg(
F.array_join(F.collect_list('class'), ',').alias('class'),
F.sum('cost').alias('cost')
)

Related

How do I create custom calculations on Groupby objects in pandas with a new row below each object

Thank you for taking the time to read through my question. I hope you can help.
I have a large DataFrame with loads of columns. One column is an ID with multiple classes on which I would like to calculate totals and other custom calculations based on the columns above it.
The DataFrame columns look something like this:
I would like to calculate the Total AREA for each ID for all the CLASSES. Then I need to calculate the custom totals for the VAR columns using the variables from the other columns. In the end I would like to have a series of grouped IDs that look like this:
I hope that this make sense. The current thinking I have applied is to use the following code:
df = pd.read_csv(data.csv)
df.groupby('ID').apply(lambda x: x['AREA'].sum())
This provides me with a list of all the summed areas, which I can store in a variable to append back to the original dataframe through the ID and CLASS column. However, I am unsure how I get the other calculations done, as shown above. On top of that, I am not sure how to get the final DataFrame to mimic the above table format.
I am just starting to understand Pandas and constantly having to teach myself and ask for help when it gets rough.
Some guidance would be greatly appreciated. I am open to providing more information and clarity on the problem if this question is insufficient. Thank you.
I am not sure if I understand your formulas correctly.
First you can simplify your formula by using the built-in sum() function:
df=pd.DataFrame({'ID':[1.1,1.1,1.2,1.2,1.2,1.3,1.3], 'Class':[1,2,1,2,3,1,2],'AREA':[350,200,15,5000,65,280,70],
'VAR1':[24,35,47,12,26,12,78], 'VAR2':[1.5,1.2,1.1,1.4,2.3,4.5,0.8], 'VAR3':[200,300,400,500,600,700,800]})
df.groupby(['ID']).sum()['AREA']
This will give the mentioned list
ID
1.1 550
1.2 5080
1.3 350
Name: AREA, dtype: int64
For Area Class 1 you just have to add a key()to the groupby() command:
df.groupby(['ID', 'Class']).sum()['AREA']
Resulting in:
ID Class
1.1 1 350
2 200
1.2 1 15
2 5000
3 65
1.3 1 280
2 70
Name: AREA, dtype: int64
Since you want to sum up the square of the sum over each Class we can add both approaches together:
df.groupby(['ID', 'Class']).apply(lambda x: x['AREA'].sum()**2).groupby('ID').sum()
With the result
ID
1.1 162500
1.2 25004450
1.3 83300
dtype: int64
I recommend to strip the command apart and try to understand each step. If you need further assistance just ask.

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..
Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

I am trying to apply a defined function to a grouped pandas grouped dataframe and output the results to a csv

I have a defined function that requires a list, and the function outputs one value for every item in the list. I need to group by industry code (SIC), and apply the function within industry (so only industry 1 firms are grouped together for the defined calculation).
example:
1 50
1 40
2 100
2 110
I do the following code:
dr=pd.read_csv("sample.csv", usecols=columns)
d1=dr.groupby('SIC')['value'].apply(list)
for groups in d1
a=my_function(groups)
b=pd.DataFrame(A)
b.to_csv('output.csv', index=False)
I expected to get an output file with the function values for all 4 rows (lets say I want the difference between that row and the average. row 1 should be 50-(avg(50+40)) which equals 5.
Instead, I get a csv file with only the last group's values. It seems like I should make a new CSV file for each group, but by doing the apply(list) I cant figure out how to identify each group.
Edit: I modified the functionality as described in the comment below to output only one file.

Pandas convert data from two tables into third table. Cross Referencing and converting unique rows to columns

I have the following tables:
Table A
listData = {'id':[1,2,3],'date':['06-05-2021','07-05-2021','17-05-2021']}
pd.DataFrame(listData,columns=['id','date'])
Table B
detailData = {'code':['D123','F268','A291','D123','F268','A291'],'id':['1','1','1','2','2','2'],'stock':[5,5,2,10,11,8]}
pd.DataFrame(detailData,columns=['code','id','stock'])
OUTPUT TABLE
output = {'code':['D123','F268','A291'],'06-05-2021':[5,5,2],'07-05-2021':[10,11,8]}
pd.DataFrame(output,columns=['code','06-05-2021','07-05-2021'])
Note: The code provided is hard coded code for the output. I need to generate the output table from Table A and Table B
Here is brief explanation of how the output table is generated if it is not self explanatory.
The id column needs to be cross reference from Table A to Table B and the dates should be put instead in Table B
Then all the unique dates in Table B should be made into columns and the corresponding stock values need to be shifted to then newly created date columns.
I am not sure where to start to do this. I am new to pandas and have only ever used it for simple data manipulation. If anyone can suggest me where to get started, it will be of great help.
Try:
tableA['id'] = tableA['id'].astype(str)
tableB.merge(tableA, on='id').pivot('code', 'date', 'stock')
Output:
date 06-05-2021 07-05-2021
code
A291 2 8
D123 5 10
F268 5 11
Details:
First, merge on id, this is like doing a SQL join. First, the
dtypes much match, hence using astype to str.
Next, reshape the dataframe using pivot to get code by date.

Detecting bad information (python/pandas)

I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances.
Thanks in advance!!
First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing
truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict()
Then you can get a column that is the 'correct' unit of measurement
df['correct_unit'] = df['product_code'].apply(truth_mapping.get)
Then you can filter to rows that do not have the correct mapping:
df[df['correct_unit'] != df['unit_of_measurement']]
Try this:
Sample df:
df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'],
'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']})
Group by and see count of all non unique pairs:
new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'})
Drop all rows where the Product Code is repeated
new.drop_duplicates(subset=['Product Code'], keep=False)

Categories

Resources