I have a csv file structured like this:
As you can see, many lines are repeated (they represent the same entity) with the attribute 'category' being the only difference between each other. I would like to join those rows and include all the categories in a single value.
For example the attribute 'category' for Walmart should be: "Retail, Dowjones, SuperMarketChains".
Edit:
I would like the output table to be structured like this:
Edit 2:
What worked for me was:
df4.groupby(["ID azienda","Name","Company code", "Marketcap", "Share price", "Earnings", "Revenue", "Shares", "Employees"]
)['Category'].agg(list).reset_index()
Quick and Dirty
df2=df.groupby("Name")['Category'].apply(','.join)
subst=dict(df2)
df['category']=df['Name'].replace(subst)
df.drop_duplicates('Name')
if you prefer multiple categories to be stored as a list in pandas column category...
change first line to
df2=df.groupby("Name")['Category'].apply(list)
Not sure if you want a new table or just a list of the categories. Below is how you could make a table with the hashes if those are important
import pandas as pd
df = pd.DataFrame({
'Name':['W','W','W','A','A','A'],
'Category':['Retail','Dow','Chain','Ecom','Internet','Dow'],
'Hash':[1,2,3,4,5,6],
})
# print(df)
# Name Category Hash
# 0 W Retail 1
# 1 W Dow 2
# 2 W Chain 3
# 3 A Ecom 4
# 4 A Internet 5
# 5 A Dow 6
#Make a new df which has one row per company and one column per category, values are hashes
piv_df = df.pivot(
index = 'Name',
columns = 'Category',
values = 'Hash',
)
# print(piv_df)
# Category Chain Dow Ecom Internet Retail
# Name
# A NaN 6.0 4.0 5.0 NaN
# W 3.0 2.0 NaN NaN 1.0
Related
My dataframe looks like below:
data = {'pred_id':[np.nan, np.nan, 'Pred ID', 258,265,595,658],
'class':[np.nan,np.nan,np.nan,'pork','sausage','chicken','pork'],
'image':['Weight',115.37,'pred','app_images/03112020/Prediction/222_prediction_resized.jpg','app_images/03112020/Prediction/333_prediction_resized.jpg','volume',np.nan]}
df = pd.DataFrame(data)
df
Edited:
I am trying create a new column 'image_name' with values from column 'image'. I want to extract a substring from column 'image' values that contains 'app_images/' in its string, and if not then keep it the same.
I tried the below code and its throwing 'Attribute Error' message.
Help me on how to find the dtype and then extract substring from values that have 'app_images/' and if not then keep the value as it is. I dont know how to fix this. Thanks in advance.
images = []
for i in df['image']:
if i.dtypes == object:
if i.__contains__('app_images/'):
new = i.split('_')[1]
name = new.split('/')[3]+'.jpg'
images.append(name)
else:
images.append(i)
df['image_name'] = images
df
Do not use a loop, use vectorial code, str.extract and a regex.
From your description and code, this seems to be what you expect:
df['image_name'] = (df['image'].str.extract(r'app_images/.*/(\d+)_[^/]+\.jpg',
expand=False)+'.jpg'
)
output:
pred_id class image image_name
0 NaN NaN Weight NaN
1 NaN NaN 115.37 NaN
2 Pred ID NaN pred NaN
3 258 pork app_images/03112020/Prediction/222_prediction_resized.jpg 222.jpg
4 265 sausage app_images/03112020/Prediction/333_prediction_resized.jpg 333.jpg
5 595 chicken volume NaN
6 658 pork NaN NaN
regex demo
Most of the the other questions regarding updating values in pandas df are focused on appending a new column or just updating the cell with a new value (i.e. replacing it). My question is a bit different. Assuming my df already has values in it, and I find a new value, I need to add it into the cell to update its value. Example if a cell already has 5 and I found the value 10 in my file that corresponds to that column/row, the value should now be 15.
But I am having trouble writing this bit of code and even getting values to show up in my dataframe.
I have a dictionary, for example:
id_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], 'Pseudomonas': ['287'], 'NONE': ['2829358', '2806529']}
And I have sample id files that contain ids and the number of times those ids showed up in a previous file where the first value is the count and the second value is the id.
cat Sample1_idsummary.txt
1,162
15,174
4,195
5,197
6,201
10,2829358
Some of the ids have the same key in id_dict and I need to create a dataframe like the following:
Sample Treponema Leptospira Azospirillum Campylobacter Pseudomonas NONE
0 sample1 1 15 0 15 0 10
Here is my script, but my issue is that my output is always zero for all columns.
samplefile=sys.argv[1]
sample_ID=samplefile.split("_")[0] ## get just ID name
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
column_names=["Sample"]
column_names.extend([x for x in list(id_dict.keys())])
df = pd.DataFrame(columns=column_names)
df["Sample"]=[sample_ID]
with open(samplefile) as sf: # open the sample taxid count file
for line in sf:
id = line.split(",")[1] # the taxid (multiple can hit the same lineage info)
idcount = int(line.split(",")[0]) # the count from uniq
# For all keys in the dict, if that key is in the sample id file use the count from the id file
# Otherwise all keys not found in the file are "0" in the df
if id in id_dict:
df[list(id_dict.keys())[list(id_dict.values().index(id))]] = idcount
return df.fillna(0)
It's the very last if statement that is confusing me. How to make idcount add each time it gives the same key and why do I always get zeros filled in?
The below mentioned method worked! Here is the updated code:
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index().astype({'id':int})
iddf = pd.read_csv(samplefile, sep=",", names=["count","id"])
df=df.merge(iddf, how='outer').fillna(0).groupby('index')['count'].sum().to_frame(sample_ID).T
return df
And the output, which is still not coming up right:
index 0 Azospirillaceae Campylobacteraceae Leptospiraceae NONE Pseudomonadacea Treponemataceae
mini 106.0 0.0 20.0 0.0 0.0 0.0 5.0
UPDATE 2
With the code below and using my proper files I've managed to get the table but cannot for the life of me get the "NONE" column to show up anymore. Any suggestions? My output is essentially every key value with proper counts but "NONE" disappears.
Instead of doing that way iteratively, you can more automate and use pandas to perform those operations.
Start by creating the dataframe from id_dict:
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index()\
.astype({'id': int})
index id
0 Treponema 162
1 Leptospira 174
2 Azospirillum 192
3 Campylobacter 195
4 Campylobacter 197
5 Campylobacter 199
6 Campylobacter 201
7 Pseudomonas 287
8 NONE 2829358
9 NONE 2806529
Read the count/id text file into a data frame:
idDF = pd.read_csv('Sample1_idsummary.txt', sep=',' , names=['count', 'id'])
count id
0 1 162
1 15 174
2 4 195
3 5 197
4 6 201
5 10 2829358
Now outer merge both the dataframes, fill NaN's with 0, then groupby index, and call sum and create the dataframe calling to_frame and passing count as column name, finally transpose the dataframe:
df.merge(idDF, how='outer').fillna(0).groupby('index')['count'].sum().to_frame('Sample1').T
OUTPUT:
index Azospirillum Campylobacter Leptospira NONE Pseudomonas Treponema
Sample1 0.0 15.0 15.0 10.0 0.0 1.0
I have a Pandas DataFrame stations with index as id:
id station lat lng
1 Boston 45.343 -45.333
2 New York 56.444 -35.690
I have another DataFrame df1 that has the following:
duration date station gender
NaN 20181118 NaN M
9 20181009 2.0 F
8 20170605 1.0 F
I want to add to df1 so that it looks like the following DataFrame:
duration date station gender lat lng
NaN 20181118 NaN M nan nan
9 20181009 New York F 56.444 -35.690
8 20170605 Boston F 45.343 -45.333
I tried doing this iteratively by referring to the station.iloc[] as shown in the following example but I have about 2 mil rows and it ended up taking a lot of time.
stat_list = []
lng_list []
lat_list = []
for stat in df1:
if not np.isnan(stat):
ref = station.iloc[stat]
stat_list.append(ref.station)
lng_list.append(ref.lng)
lat_list.append(ref.lat)
else:
stat_list.append(np.nan)
lng_list.append(np.nan)
lat_list.append(np.nan)
Is there a faster way to do this?
Looks like this would be best solved with a merge which should significantly boost performance:
df1.merge(stations, left_on="station", right_index=True, how="left")
This will leave you with two columns station_x and station_y if you only want the station column with the string names in you can do:
df_merged = df1.merge(stations, left_on="station", right_index=True, how="left", suffixes=("_x", ""))
df_final = df_merged[df_merged.columns.difference(["station_x"])]
(or just rename one of them before you merge)
I have a frame moviegoers that includes zip codes but not cities.
I then redefined moviegoers to be zipcodes and changed the data type of zip codes to be a data frame instead of a series.
zipcodes = pd.read_csv('NYC1-moviegoers.csv',dtype={'zip_code': object})
I know the dataset URL I need is this: https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv.
I defined a dataframe, zip_codes, to call the data from that dataset and change the dataset type from series to dataframe so its in the same format as the zipcodes dataframe.
I want to merge the dataframes so I can have the movie goer data. But, instead of zipcodes, I want to have the state abbreviation. This is where I am having issues.
The end goal is to count the number of movie goers per state. Example ideal output:
CA 116
MN 78
NY 60
TX 51
IL 50
Any ideas would be greatly appreciated.
I think need map by Series and then use value_counts for count:
print (zipcodes)
zip_code
0 85711
1 94043
2 32067
3 43537
4 15213
s = zip_codes.set_index('Zipcode')['State']
df = zipcodes['zip_code'].map(s).value_counts().rename_axis('state').reset_index(name='count')
print (df.head())
state count
0 OH 1
1 CA 1
2 FL 1
3 AZ 1
4 PA 1
Simply merge both datasets on Zipcode columns then run groupby for state counts.
# READ DATA FILES WITH RENAMING OF ZIP COLUMN IN FIRST
url = "https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv"
moviegoers = pd.read_csv('NYC1-moviegoers.csv', dtype={'zip_code': object}).rename(columns={'zip_code': 'Zipcode'})
zipcodes = pd.read_csv(url, dtype={'Zipcode': object})
# MERGE ON COMMON FIELD
merged_df = pd.merge(moviegoers, zipcodes, on='Zipcode')
# AGGREGATE BY INDICATOR (STATE)
merged_df.groupby('State').size()
# ALTERNATIVE GROUP BY COUNT
merged_df.groupby('State')['Zipcode'].agg('count')
I am trying to learn Python, coming from a SAS background.
I have imported a SAS dataset, and one thing I noticed was that I have multiple date columns that are coming through as SAS dates (I believe).
In looking around, I found a link which explained how to perform this (here):
The code is as follows:
alldata['DateFirstOnsite'] = pd.to_timedelta(alldata.DateFirstOnsite, unit='s') + pd.datetime(1960, 1, 1)
However, I'm wondering how to do this for multiple columns. If I have multiple date fields, rather than repeating this line of code multiple times, can I create a list of fields I have, and then run this code on that list of fields? How is that done?
Thanks in advance
Yes, it's possible to create a list and iterate through that list to convert the SAS date fields to pandas datetime. However, I'm not sure why you're using a to_timedelta method, unless the SAS date fields are represented by seconds after 1960/01/01. If you plan on using the to_timedelta method, then its simply a case of creating a function that takes your df and your field and passing those two into your function:
def convert_SAS_to_datetime(df, field):
df[field] = pd.to_timedelta(df[field], unit='s') + pd.datetime(1960, 1, 1)
return df
Now, let's suppose you have your list of fields that you know should be converted to a datetime field (along with your df):
my_list = ['field1','field2','field3','field4','field5']
my_df = pd.read_sas('mySASfile.sas7bdat') # your SAS data that's converted to a pandas DF
You can now iterate through your list with a for loop while passing those fields and your df to the function:
for field in my_list:
my_df = convert_SAS_to_datetime(my_df, field)
Now, the other method I would recommend is using the to_datetime method, but this assumes that you know what the SAS format of your date fields are.
e.g. 01Jan2016 # date9 format
This is when you might have to look through the documentation here to determine the directive to converting the date. In the case of a date9 format, then you can use:
df[field] = pd.to_datetime(df[date9field], format="%d%b%Y")
If i read your question correctly, you want to apply your code to multiple columns? to do that simple do this:
alldata[['col1','col2','col3']] = 'your_code_here'
Exmaple:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['Pharmacy of IDAHO','Access medicare arkansas','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
df[['E', 'D']] = 1 # <---- notice double brackets
print(df)
A B C D E
0 NaN 1.0 Pharmacy of IDAHO 1 1
1 NaN 0.0 Access medicare arkansas 1 1
2 3.0 3.0 NJ Pharmacy 1 1
3 4.0 5.0 Idaho Rx 1 1
4 5.0 0.0 CA Herbals 1 1
5 5.0 0.0 Florida Pharma 1 1
6 3.0 NaN AK RX 1 1
7 1.0 9.0 Ohio Drugs 1 1
8 5.0 0.0 PA Rx 1 1
9 NaN 0.0 USA Pharma 1 1
Notice the double brackets in the beginning. Hope this helps!