My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')
I have a data frame which consists of the following data:
cus_id sex city state product_type var1 var2 type score
CA-1 Male ABC New York type-1 10 10 P-1 750
CA-2 Female ABC Alaska type-2 10 9.5 P-2 850
CA-3 Male Denver dfdfd type-3 10 11.1 P-3 560
CA-4 Female esx Nsdfe type-4 15 15 P-3 734
CA-5 Male dfr dfdedc type-5 13 12.9 P-3 798
CA-6 Male xds Nsdfe type-6 14.5 10.8 P-2 700
CA-7 Female edf New York type-5 14.2 14 P-2 550
CA-8 Female xde New York type-5 04 04 P-1 650
CA-9 Male wer New York type-1 10 11 P-1 730
Using the above-mentioned data frame, I want to create a segment considering the variables sex, City, State and score for the below-mentioned independent parameters.
product_type : is the static from type-1 to type-7
type: is the static from P-1 to P-3
The score is range from 100 to 1000,that we can break as per the segment identified for product_type and type
I want to Identify the cluster where the difference of var2 value from var1 value is minimum in terms of percentage. For Example, for cus_id CA-1 the match is 100% so we will have the segment for 100% with the matching sex, city, state and score variables.
I don't know how to make cluster using K means, Need approach and suggestion by SO.
I am working with an 80 Gb data set in Python. The data has 30 columns and ~180,000,000 rows.
I am using the chunk size parameter in pd.read_csv to read the data in chunks where I then iterate through the data to create a dictionary of the counties with their associated frequency.
This is where I am stuck. Once I have the list of counties, I want to iterate through the chunks row-by-row again summing the values of 2 - 3 other columns associated with each county and place it into a new DataFrame. This would roughly be 4 cols and 3000 rows which is more manageable for my computer.
I really don't know how to do this, this is my first time working with a large data set in python.
import pandas as pd
from collections import defaultdict
df_chunk = pd.read_csv('file.tsv', sep='\t', chunksize=8000000)
county_dict = defaultdict(int)
for chunk in df_chunk:
for county in chunk['COUNTY']:
county_dict[county] += 1
for chunk in df_chunk:
for row in chunk:
# I don't know where to go from here
I expect to be able to make a DataFrame with a column of all the counties, a column for total sales of product "1" per county, another column for sales of product per county, and then more columns of the same as needed.
The idea
I was not sure whether you have data for different counties (e.g. in UK or USA)
or countries (in the world), so I decided to have data concerning countries.
The idea is to:
Group data from each chunk by country.
Generate a partial result for this chunk, as a DataFrame with:
Sums of each column of interest (per country).
Number of rows per country.
To perform concatenation of partial results (in a moment), each partial
result should contain the chunk number, as an additional index level.
Concatenate partial results vertically (due to the additional index level,
each row has different index).
The final result (total sums and row counts) can be computed as
sum of the above result, grouped by country (discarding the chunk
number).
Test data
The source CSV file contains country names and 2 columns to sum (Tab separated):
Country Amount_1 Amount_2
Austria 41 46
Belgium 30 50
Austria 45 44
Denmark 31 42
Finland 42 32
Austria 10 12
France 74 54
Germany 81 65
France 40 20
Italy 54 42
France 51 16
Norway 14 33
Italy 12 33
France 21 30
For the test purpose I assumed chunk size of just 5 rows:
chunksize = 5
Solution
The main processing loop (and preparatory steps) are as follows:
df_chunk = pd.read_csv('Input.csv', sep='\t', chunksize=chunksize)
chunkPartRes = [] # Partial results from each chunk
chunkNo = 0
for chunk in df_chunk:
chunkNo += 1
gr = chunk.groupby('Country')
# Sum the desired columns and size of each group
res = gr.agg(Amount_1=('Amount_1', sum), Amount_2=('Amount_2', sum))\
.join(gr.size().rename('Count'))
# Add top index level (chunk No), then append
chunkPartRes.append(pd.concat([res], keys=[chunkNo], names=['ChunkNo']))
To concatenate the above partial results into a single DataFrame,
but still with separate results from each chunk, run:
chunkRes = pd.concat(chunkPartRes)
For my test data, the result is:
Amount_1 Amount_2 Count
ChunkNo Country
1 Austria 86 90 2
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
2 Austria 10 12 1
France 114 74 2
Germany 81 65 1
Italy 54 42 1
3 France 72 46 2
Italy 12 33 1
Norway 14 33 1
And to generate the final result, summing data from all chunks,
but keeping separation by countries, run:
res = chunkRes.groupby(level=1).sum()
The result is:
Amount_1 Amount_2 Count
Country
Austria 96 102 3
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
France 186 120 4
Germany 81 65 1
Italy 66 75 2
Norway 14 33 1
To sum up
Even if we look only on how numbers of rows per country are computed,
this solution is more "pandasonic" and elegant, than usage of defaultdict
and incrementation in a loop processing each row.
Grouping and counting of rows per group works significantly quicker
than a loop operating on rows.
This question already has an answer here:
Convert pandas.groupby to dict
(1 answer)
Closed 4 years ago.
Given a table (/dataFrame) x:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Is it possible to split the table into two tables based on the name column (that acts as an index), and nest the two tables under the same object (not sure about the exact terms to use). So in the example above, tables[0] will be:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
and tables[1] will be:
name day earnings revenue
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Note that that the number of rows in each 'sub-table' may vary.
Cheers,
Create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('name')))
And then select by keys - value of name column:
print (dfs['Oliver'])
print (dfs['John'])
I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))