I have two csv files like below that I'd like to merge - more or less using the first column ID_ as the unique identifier, and append the AMT column to a new row in the final file.
CSV1
ID_ CUSTOMER_ID_ EMAIL_ADDRESS_
1090 1 example1#example.com
1106 2 example2#example.com
1145 3 example3#example.com
1206 4 example4#example.com
1247 5 example5#example.com
1254 6 example6#example.com
1260 7 example7#example.com
1361 8 example8#example.com
1376 9 example9#example.com
CSV2
ID_ AMT
1090 5
1106 5
1145 5
1206 5
1247 5
1254 65
1260 5
1361 10
1376 5
Here's what I'm looking for in a final file:
ID_ CUSTOMER_ID_ EMAIL_ADDRESS_ AMT
1090 1 example1#example.com 5
1106 2 example2#example.com 5
1145 3 example3#example.com 5
1206 4 example4#example.com 5
1247 5 example5#example.com 5
1254 6 example6#example.com 65
1260 7 example7#example.com 5
1361 8 example8#example.com 10
1376 9 example9#example.com 5
I've tried modifying a this below as much as possible, but not able to get what I'm looking for. Really stuck on this - not sure what else I can do. Really appreciate any and all help!
join -t, File1.csv File2.csv
Data shows in this example contains tabs, but my actual files are CSVs as mentioned and will contain commas as a separator.
This can be easily done using Pandas library. Here is my code to do this:
'''
This program reads two csv files and merges them based on a common key column.
'''
# import the pandas library
# you can install using the following command: pip install pandas
import pandas as pd
# Read the files into two dataframes.
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')
# Merge the two dataframes, using _ID column as key
df3 = pd.merge(df1, df2, on = 'ID_')
df3.set_index('ID_', inplace = True)
# Write it to a new CSV file
df3.to_csv('CSV3.csv')
You can find a short tutorial on pandas here:
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
Related
I have a dataframe with two columns
df
hr count
2 53
3 1586
4 890
5 833
6 209
I want to take average/Percentile of each value of column and want to store result in new column. Currently I am using this way.
df['avg'] = (df['count'] / df['count'].sum()) * 100
df
hr count avg
2 53 1.484178
3 1586 44.413330
4 890 24.922991
5 833 23.326799
6 209 5.852702
I want to this by built in function like mean() . How can I achieve this with built in function?
I am reading in a data frame from a csv file and I am trying to create a time graph of when the tickets were issued by the frequency of tickets issued. The column containing the times is set in a format of hours with a letter indicating am or pm i.e 1200A. Because of this when I try sorting the data frame in ascending order only the numerical value is considered and the A, P is disregarded. How can I sort the index of my data frame to consider the A and P
I have tried using sort_index
function and this works but only in sorting the numbers
from matplotlib
import pyplot as plt
import pandas as pd
tickets = pd.read_csv("./Parking_Violations_Issued_-_Fiscal_Year_2019.csv")
d2=tickets['Violation Time'].value_counts()
df2=d2.sort_index(ascending=1, sort_remaining='true')
Sample dataset:
Index Violation Time
.847A 1
0000A 801
0000P 22
0001A 545
0001P 1
0002A 499
0003A 520
0004A 498
0004P 1
0005A 619
0006A 983
0007A 993
0008A 1034
0008P 1
0009A 1074
Original CSV link
This will do your job.
Explanation:
First, I converted your time column with tuple, like [('.847', 'A'), ('0000', 'A'), ('0001', 'A') ...
Next, I have sorted according to your logic i.e., second element('A', 'P') of tuple and then first element(numbers) and Joined those tuples to get back to its original state.
Finally merged with the original dataset to get required output.
Code:
>>> tickets # Assuming your initial dataframe looks like below, as mentioned in OP
Index Violation Time
0 .847A 1
1 0000A 801
2 0000P 22
3 0001A 545
4 0001P 1
5 0002A 499
6 0003A 520
7 0004A 498
8 0004P 1
9 0005A 619
10 0006A 983
11 0007A 993
12 0008A 1034
13 0008P 1
>>> final_df = pd.DataFrame(["".join(i) for i in sorted(tickets.apply(lambda x: (x['Index'][:-1], x['Index'][-1]), axis=1), key=lambda x : (x[1], x[0]))])
>>> df2.rename(columns={0:'Index'}, inplace=True)
Output:
>>> final_df.merge(tickets)
Index Violation Time
0 .847A 1
1 0000A 801
2 0001A 545
3 0002A 499
4 0003A 520
5 0004A 498
6 0005A 619
7 0006A 983
8 0007A 993
9 0008A 1034
10 0009A 1074
11 0000P 22
12 0001P 1
13 0004P 1
14 0008P 1
I would consider writing an algorithm to parse the time strings into the sorting order you would like.
If indeed every Violation Time has an A or P at the last character, you could create a new sorting column which parses the time string into a datetime object. Depending on how dirty the data is, you will have to add some additional parsing checks for the hour and minute substrings, but here is a good start:
EDIT: I added in checks for length and string type to ensure the string is parseable before parsing.
from datetime import datetime
import pandas as pd
def parseDateTime(x, tformat='%I%M%p'):
if pd.isnull(x):
return None
if type(x) is str and len(x) == 5:
if x[0:2].isdigit() and x[2:4].isdigit():
newString = str(x).strip() + 'M'
parsedDateTime = datetime.strptime(newString,tformat)
return parsedDateTime
else:
return None
Note that without date information, the times will all be treated as being on the same day.
Now, you can apply this function to the column and then use the new parsed column for your sorting purposes.
tickets['Violation Time Parsed'] = tickets['Violation Time'].apply(parseDateTime)
The Scenario
My dataset was in format as follows:
Which I refer as ACTUAL FORMAT
uid iid rat tmp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping:
Which I refer as MATRIX FORMAT
uid 1 2 3 4
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
5 4 3 2.1952622349 3.1913491995
6 4 3.4233243638 3.8255108621 3.948791424
7 4.4983411706 4.0477240538 4.0241460801 5
8 4.1773004578 4.0191412859 4.0442369862 4.1754642909
9 4.2733984521 4.2797130861 4.2682723131 4.2816986988
15 1 3.0554789259 3.2279546684 3.1282278957
16 5 4.3473697565 4.0675394438 5
The Problem:
Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT.
Coversion:
For conversion of OLD to MATRIX Format I did:
Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid')
I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?
You need stack with rename_axis for columns names and last reset_index:
df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat')
print (df.head())
uid iid rat
0 4 1 4.332076
1 4 2 4.340775
2 4 3 4.311200
3 4 4 4.341143
4 5 1 4.000000
I'm using Movie Lens Dataset in Python Pandas. I need to print the matrix of u.data a tab separated file in foll. way
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating Rating
UserID2 Rating Rating Rating
I've already been through following links
One - Dataset is much huge put it in series
Two - Transpose of Row not mentioned
Three - Tried with reindex so as
to get NaN values in one column
Four - df.iloc and df.ix
didn't work either
I need the output so as it shows me rating and NaN (when not rated) for movies w.r.t. users.
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating NaN
UserID2 Rating NaN Rating
P.S. I won't mind having solutions with numpy, crab, recsys, csv or any other python package
EDIT 1 - Sorted the data and exported but got an additional field
df2 = df.sort_values(['UserID','MovieID'])
print type(df2)
df2.to_csv("sorted.csv")
print df2
The file produces foll. sorted.csv file
,UserID,MovieID,Rating,TimeStamp
32236,1,1,5,874965758
23171,1,2,3,876893171
83307,1,3,4,878542960
62631,1,4,3,876893119
47638,1,5,3,889751712
5533,1,6,5,887431973
70539,1,7,4,875071561
31650,1,8,1,875072484
20175,1,9,5,878543541
13542,1,10,3,875693118
EDIT 2 - As asked in Comments
Here's the format of Data in u.data file which acts as input
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
One method:
Use pivot_table and if one value per user and movie id then aggfunc doesn't matter, however if there are multiple values, the choose your aggregation.
df.pivot_table(values='Rating',index='UserID',columns='MovieID', aggfunc='mean')
Second method (no duplicate userid, movieid records):
df.set_index(['UserID','MovieID'])['Rating'].unstack()
Third method (no duplicate userid, movieid records):
df.pivot(index='UserID',columns='MovieID',values='Rating')
Fourth method (like the first you can choose your aggregation method):
df.groupby(['UserID','MovieID'])['Rating'].mean().unstack()
Output:
MovieID 1 2 3 4 5 6 7 8 9 10
UserID
1 5 3 4 3 3 5 4 1 5 3
Suppose I have the following rows in a Pandas DataFrame:
970 P-A1-1019-03-C15,15 23987896 1 8
971 P-A1-1019-06-B15,15 23251711 4 8
972 P-A1-1019-08-C15,15 12160034 2 8
973 P-A1-1020-01-D15,15 8760012 1 8
I'd like to alter the second column to remove the ",15" from the string. Desired end state would be like this:
970 P-A1-1019-03-C15 23987896 1 8
971 P-A1-1019-06-B15 23251711 4 8
972 P-A1-1019-08-C15 12160034 2 8
973 P-A1-1020-01-D15 8760012 1 8
The thing to remove won't always be ",15", as it could be ",10", ",03", ",4", etc. Additionally, some rows in the input are differently formatted, and may look like this:
4 RR00-0,2020338 24380076 4 12
5 RR00-0,2020738 10562767 2 12
6 ,D 24260808 1 12
7 ,D 23521158 1 12
Initially, I'm only interested in the cases where the string DOES fit the form of "P-A1-1019-03-C15", so it would be nice to be able to drop rows which don't match that specific format.
Is there a built in way to do this kind of processing, or will I need to iterate over every row manually?
This should remove all ',15' values:
dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if [value].split(',')[0] == '15' else value)
This should remove all ',15' values if they are in the format you provided:
dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if ([value].split(',')[0] == '15') & ('P-A1-' in value) else value)