I would like to split a column in two, with a given delimiter. Excel file does not have a header in the columns. Example input and output files and are in the screenshots below:
Input:
Output:
You can follow the below steps and modify to suit your needs:
My original dataframe looks like this:
df1.head()
District Population Ratio
0 Hampshire 25000 1.56
1 Dorset 500000 7.34
2 Wiltshire 735298 3.67
3 Worcestershire 12653 8.23
Since Ratio is a numeric column, you need to make it into string using:
df1['Ratio'] = df1['Ratio'].apply(str)
Once the column is string, you can split it as per your need using:
df1['Decimal'] = df1['Ratio'].str.split('.').str[1]
df1['Ratio'] = df1['Ratio'].str.split('.').str[0]
Final Dataframe looks like this where values are separated with respect to delimiter:
df1.head()
District Population Ratio Decimal
0 Hampshire 25000 1 56
1 Dorset 500000 7 34
2 Wiltshire 735298 3 67
3 Worcestershire 12653 8 23
Related
I have a sample dataframe that looks like this:
primaryName averageRating primaryProfession knownForTitles runtimeMinutes
1 Fred Astaire 7.0 soundtrack,actor,miscellaneous tt0072308 165
2 Fred Astaire 6.9 soundtrack,actor,miscellaneous tt0031983 93
3 Fred Astaire 7.0 soundtrack,actor,miscellaneous tt0050419 103
4 Fred Astaire 7.1 soundtrack,actor,miscellaneous tt0053137 134
So basically i want to take the average of averageRating column, extract "actor/actress" from primaryProfession column, count of knownForTitles column and sum of runtimeMinutes column based on primaryName column.
The output dataframe should look like this:
primaryName averageRating primaryProfession knownForTitles runtimeMinutes
1 Fred Astaire 28 actor 4 495
Any ideas how i can achieve this? Thanks in advance for the help.
Try this:
df.loc[df['primaryProfession'].str.contains('actor'), 'primaryProfession'] = 'actor'
df.loc[df['primaryProfession'].str.contains('actress'), 'primaryProfession'] = 'actress'
df.groupby(['primaryName', 'primaryProfession'], as_index=False).agg({'averageRating':'mean', 'knownForTitles':'count', 'runtimeMinutes':'sum'})
Hi I'm cleaning up a bigdata about food products and I'm struggling with one columns types(df['serving_size']=O) that informs about the size of the product. It's a pandas dataframe that contains 300.000 observations. I succeeded to clean the text that was included with the size with helps of Regex:
df['serving_size'] = df['serving_size'].str.replace('[^\d\,\.]', ' ')
df['serving_size'] = df['serving_size'].str.replace('(^\d\,\.+)\s', '')
And I got this (the space are White Space)
40.5 23
13
87 23
123
72,5
And my goals would be to keep only the first group of numbers for each rows including the ,and . like so:
40.5
13
87
123
72.5
Despite my reserch I didn't find how to achieve it ? Thanks
You can use .str.extract() with regex, as follows:
df['serving_size'] = df['serving_size'].str.extract(r'(\d+(?:,\d+)*(?:\.\d+)?)')
Result:
print(df)
serving_size
0 40.5
1 13
2 87
3 123
4 72.5
I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?
The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.
I'm using Movie Lens Dataset in Python Pandas. I need to print the matrix of u.data a tab separated file in foll. way
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating Rating
UserID2 Rating Rating Rating
I've already been through following links
One - Dataset is much huge put it in series
Two - Transpose of Row not mentioned
Three - Tried with reindex so as
to get NaN values in one column
Four - df.iloc and df.ix
didn't work either
I need the output so as it shows me rating and NaN (when not rated) for movies w.r.t. users.
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating NaN
UserID2 Rating NaN Rating
P.S. I won't mind having solutions with numpy, crab, recsys, csv or any other python package
EDIT 1 - Sorted the data and exported but got an additional field
df2 = df.sort_values(['UserID','MovieID'])
print type(df2)
df2.to_csv("sorted.csv")
print df2
The file produces foll. sorted.csv file
,UserID,MovieID,Rating,TimeStamp
32236,1,1,5,874965758
23171,1,2,3,876893171
83307,1,3,4,878542960
62631,1,4,3,876893119
47638,1,5,3,889751712
5533,1,6,5,887431973
70539,1,7,4,875071561
31650,1,8,1,875072484
20175,1,9,5,878543541
13542,1,10,3,875693118
EDIT 2 - As asked in Comments
Here's the format of Data in u.data file which acts as input
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
One method:
Use pivot_table and if one value per user and movie id then aggfunc doesn't matter, however if there are multiple values, the choose your aggregation.
df.pivot_table(values='Rating',index='UserID',columns='MovieID', aggfunc='mean')
Second method (no duplicate userid, movieid records):
df.set_index(['UserID','MovieID'])['Rating'].unstack()
Third method (no duplicate userid, movieid records):
df.pivot(index='UserID',columns='MovieID',values='Rating')
Fourth method (like the first you can choose your aggregation method):
df.groupby(['UserID','MovieID'])['Rating'].mean().unstack()
Output:
MovieID 1 2 3 4 5 6 7 8 9 10
UserID
1 5 3 4 3 3 5 4 1 5 3
I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6