In the data set of a market, each data belongs to a product group. I want to group this data group by fische number and find the total number of male and female customers.
The data set is as in the picture. dataset The number of unique fischeno is 141783. Therefore, the total number of customers should be 141783.
For example
If you just wanna count the total number of each gender column, you may try this:
df = pd.read_csv('./data.csv')
df = df.groupby('ficheno').first()
male_count = df[df.gender == 'm'].gender.count()
female_count = df[df.gender == 'f'].gender.count()
This will result in male = 2, female = 3 according to your sample dataset.
I have a sample of companies with financial figures which I would like to compare. My data looks like this:
Cusip9 Issuer IPO Year Total Assets Long-Term Debt Sales SIC-Code
1 783755101 Ryerson Tull Inc 1996 9322000.0 2632000.0 633000.0 3661
2 826170102 Siebel Sys Inc 1996 995010.0 0.0 50250.0 2456
3 894363100 Travis Boats & Motors Inc 1996 313500.0 43340.0 23830.0 3661
4 159186105 Channell Commercial Corp 1996 426580.0 3380.0 111100.0 7483
5 742580103 Printware Inc 1996 145750.0 0.0 23830.0 8473
For every company I want to calculate a "similarity Score". This score should indicate the comparability with other companies. Therefore I want to compare them in different financial figures. The comparability should be expressed as the euclidean distance, the square root of the sum of the squared differences between the financial figures, to the "closest company". So I need to calculate the distance to every company, that fits these conditions, but only need the closest score. Assets of Company 1 minus Assets of Company 2 plus Debt Company 1 minus Debt Comapny 2....
√((x_1-y_1 )^2+(x_2-y_2 )^2)
This should only be computed for companies with the same SIC-Code and the IPO Year of the comparable companies should be smaller then for the company for which the "Similarity score" is computed. I only want to compare these companies with already listed companies.
Hopefully, my point get's clear. Has someone any idea where I can start? I am just starting with programming and completely lost with this.
Thanks in advance.
I would first create different dataframes according to the SIC-code, so every new dataframe only contains companies with the same SIC-code. Then for every of those dataframes, just double loop over the companies and compute the scores, and store them in a matrix. (So you'll end up with a symmetrical matrix of scores.)
try this , Here I have taken Compare the company with IPO Year Equal to or Smaller then since You didn't give any company record with smaller IPO year) You can change it to only Smaller than (<) in statement Group=df[...]
def closestCompany(companyRecord):
Group = df[(df['SIC-Code']==companyRecord['SIC-Code']) & (df['IPO Year'] <= companyRecord['IPO Year']) & (df['Issuer'] != companyRecord['Issuer'])]
return (((Group['Total Assets']-companyRecord['Total Assets'])**2 + (Group['Long-Term Debt'] - companyRecord['Long-Term Debt'])**2)**0.5).min()
df['Closest Company Similarity Score']=df.apply(closestCompany, axis=1)
df
I have two different CSVs with different information that I need. The first has an account number, a ticker (mutual funds), and a dollar amount. The second has a list of Tickers and their classification (Stock, bond, etc.) I want to merge the two on the Ticker so that I have the account number, Ticker, classification, and dollar amount all together. Several of the account numbers hold the same funds, meaning the ticker will be used multiple times. When I try merging I get rows that duplicate and a lot of missing information.
I tried merging with inner and on the left. I tried making the second CSV a dictionary to reference. I attempted a for loop with lamda, but I'm pretty new to this so that didn't go well. I also tried to groupby account number and ticker before merging, but that didn't work either. The columns I'm trying to merge have the same datatype. I tried with non-float object and string.
pd.merge(df1, df2, on = 'Ticker', how = 'inner')
expected: (each account number may have 5 unique tickers)
A B C D
1 a bond 500
1 b stock 100
1 c bond 250
2 a bond 300
2 b stock 400
what I get:
A B C D
1 a bond 500
1 a bond 500
1 a bond 500
2 a bond 300
2 a bond 300
It seems to overwrite all the unique rows for the account number with the first row.
I am making a dummy dataset of list of companies as user_id, the jobs posted by each company as job_id and c_id as candidate id.
I have already achieved the first two steps and my dataset looks like below.
user_id job_id
0 HP HP2
1 Microsoft Microsoft4
2 Accenture Accenture2
3 HP HP0
4 Dell Dell4
5 FIS FIS1
6 HP HP0
7 Microsoft Microsoft4
8 Dell Dell2
9 Accenture Accenture0
Also they are shuffled. now i wish to add a random candidate id to this dataset in such a way that no c_id is repeated to a particular job_id.
My approach for this is as follows.
joblist is a list of all job_ids.
for i in range(50):
l = list(range(0,len(df[df['job_id'] == joblist[i]])))
random.shuffle(l)
df['c_id'][df['job_id'] == joblist[i]] = l
after which i tested it as
len(df['c_id'][df['job_id'] == joblist[0]])
output = 168
df['c_id'][df['job_id'] == joblist[0]].nunique()
output = 101
and the same is happening with all values. i have rechecked the uniqueness of l after each step and its 168 unique values.
What am i doing wrong here?
Unique IDs are provided by basic pd functions, so you don't need anything fancy. Solutions vary in efficiency based on how big your df is.
# Hashing for small datasets:
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
# Grouping for larger datasets:
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
# Assign:
df.assign(id=(#Some combo of columns).astype('category').cat.codes)
Further reading:
Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
How to assign a unique ID to detect repeated rows in a pandas dataframe?
Here is a sample df:
data = {"Brand":{"0":"BrandA","1":"BrandA","2":"BrandB","3":"BrandB","4":"BrandC","5":"BrandC"},"Cost":{"0":18.5,"1":19.5,"2":6,"3":6,"4":17.69,"5":18.19},"IN STOCK":{"0":10,"1":15,"2":5,"3":1,"4":12,"5":12},"Inventory Number":{"0":1,"1":1,"2":2,"3":2,"4":3,"5":3},"Labels":{"0":"Black","1":"Black","2":"White","3":"White","4":"Blue","5":"Blue"},"Maximum Price":{"0":30.0,"1":35.0,"2":50,"3":45.12,"4":76.78,"5":76.78},"Minimum Price":{"0":23.96,"1":25.96,"2":12.12,"3":17.54,"4":33.12,"5":28.29},"Product Name":{"0":"Product A","1":"Product A","2":"ProductB","3":"ProductB","4":"ProductC","5":"ProductC"}}
df = pd.DataFrame(data=data)
My actual data set is much larger, but maintains the same pattern of there being 2 rows that share the same Inventory Number throughout.
My goal is to create a new data frame that contains only the inventory numbers where a cell value is not duplicated across both rows, and for those inventory numbers, only contains the data from the row with the lower index that is different from the other row.
For this example the resulting data frame would need to look like:
data = {"Inventory Number":{"0":1,"1":2,"2":3},"Cost":{"0":18.50,"1":"","2":17.69},"IN STOCK":{"0":10,"1":5,"2":""},"Maximum Price":{"0":30,"1":50,"2":""},"Minimum Price":{"0":23.96,"1":12.12,"2":33.12}}
df = pd.DataFrame(data=data)
The next time this would run, perhaps nothing changed in the "Maximum Price", so that column would need to not be included at all.
I was hoping someone would have a clean solution using groupby, but if not, i imagine the solution would include dropping all duplicates. then looping through all of the remaining inventory numbers, evaluating each column for duplicates.
icol = 'Inventory Number'
d0 = df.drop_duplicates(keep=False)
i = d0.groupby(icol).cumcount()
d1 = d0.set_index([icol, i]).unstack(icol).T
d1[1][d1[1] != d1[0]].unstack(0)
Cost IN STOCK Maximum Price Minimum Price
Inventory Number
1 19.5 15 35 25.96
2 None 1 45.12 17.54
3 18.19 None None 28.29
Try this:
In [68]: cols = ['Cost','IN STOCK','Inventory Number','Maximum Price','Minimum Price']
In [69]: df[cols].drop_duplicates(subset=['Inventory Number'])
Out[69]:
Cost IN STOCK Inventory Number Maximum Price Minimum Price
0 18.5 10 100566 30.0 23.96