by changing dataframe some columns are duplicated

by changing dataframe some columns are duplicated - python

I have dataset:
,target,text
0,0,awww thats bummer shoulda got david carr third day
1,0,upset cant update facebook texting might cry result school today also blah
2,0,dived many times ball managed save 50 rest go bounds
3,0,whole body feels itchy like fire
4,0,behaving im mad cant see
5,0,whole crew
6,0,need hug
I wanted to separate my csv and bring all data whoch has target = 0 to another .csv
data_neg = df['target'] == '0'
df_neg = df[data_neg]
df_neg.to_csv("negative.csv")
And aftrer doing this column in negative.csv which has no name is duplicated:
,Unnamed: 0,target,text
0,0,0,awww thats bummer shoulda got david carr third day
1,1,0,upset cant update facebook texting might cry result school today also blah
2,2,0,dived many times ball managed save 50 rest go bounds
3,3,0,whole body feels itchy like fire
4,4,0,behaving im mad cant see
5,5,0,whole crew
why it happens and how to avoid duplicating it? it only happens with the first column with id

Create a copy and specify which column is your index when reading the CSV file:
# ...
df_neg = df[data_neg].copy()
df_neg.to_csv("negative.csv")
# For reading it
df_neg = pd.read_csv("negative.csv", index_col=0)

Related

Create new proper time-dataset from raw data with Python

First of all, I'm sorry if this question has already been asked but I believe my challenge is specific enough. I'm not looking for complete answers but simply guidelines on how I can proceed.
First of all, I have a raw dataset of monitoring participants. This data include things like income, savings, etc... and these participants have been tracked for 6 months (Jan to Jun). But the data is stored in a whole single Excel file with a column to specify the month, which means that one participant's name comes back 6 times in the file, one for each month. Each participant has a unique ID.
I want to transfrom this data in a more workable way and I wanted to learn to do it with Python. But then I feel stuck and rusty because it's been ages since I've coded and I'm only used to the codes I use on a regular basis (printing grouped averages, etc...); Here's the steps I want to follow:
a. Start by creating a column which contains a unique list of participants that have been tracked using the ID. Each participant has to be cited once only;
b. Each participants is recorded with an activity and sub-activity type in the original file, which will need to be added in the new dataset as well;
c. For the month of January for example, I want to create a 'january_income' column in which the income from january has been dragged from the raw dataset, and so on for each variable and each month.
Can anyone provide guidelines on how I may proceed? As I said, it doesn't have to be specific codes, it can be methods or steps along with the function I can use.
Thanks alot already.
N.B: I use Spyder as a working environment.

Your question is not specific. But you can try and adjust the code below:
import csv
"""
Convert your excel file to csv format
This sample assumes that you have a csv file with the first row as header or fieldnames
"""
with open('test.csv','w') as fp:
fp.write("""ID,Name,Income,Savings,Month
1,"Sample Name",1000,100,1
""")
def format(infile = 'infile.csv', outfile='outfile.csv'):
months = ['January', 'February', 'March'] #Add specific months
target_fields = ['Income', 'Savings'] # Add your desired fields
timestamp_field = 'Month' #The field which indicate the month of the row
ID_field = 'ID' # The field which indicates the unique identifier of the participant
part_specific_fields = [ID_field, 'Name'] # The fields which are specific for each participant, these fields won't be touched at all.
target_combined_fields = [f'{month}_{field}' for field in target_fields for month in months]
total_fields = part_specific_fields + target_combined_fields
temp = {}
with open(infile,'r') as fpi, open(outfile,'w') as fpo:
reader = csv.DictReader(fpi)
for row in reader:
ID = int(row[ID_field])
if ID not in temp:
temp[ID] = {}
for other_field in part_specific_fields:
# Insert the constant columns that should not be touched
temp[ID][other_field] = row[other_field]
month_pos = int(row[timestamp_field]) - 1 # subtract 1 for 0 indexing
month = months[month_pos] # Month name in plain English
for field in target_fields:
temp[ID][f'{month}_{field}'] = row[field]
# All the processing completed
#now write the data
writer = csv.DictWriter(fpo, fieldnames=total_fields)
writer.writeheader()
for row in temp.values():
writer.writerow(row)
# File has been wriiten successfully
#now return the mapped dictionary
return temp
print(format('test.csv'))
First, You have to convert your .xls file to .csv format
Process the each row and map that to specific <month>_<field> keys.
Write the processed data to outfile.csv file

Thanks for the notes. First of all, I'm sorry if my post is not specific and thanks for initiating me on the community. Since my initial post, I've made some effort to work on my data and with my actual knowledge of the langage, all I could come up with was a filtering code as my code below shows. This lets me have a column for each data of each month okay but I'm stuck on two things: first, I had to repeat this code for each month and change the months in the labels. I wouldn't have minded that approach if I didnt have to face another problem: This doesn't take in account the fact that some participants have not been tracked on certain months, which means that even if the data was sorted according to ID number, there is a mismatch between the columns because their length vary according to the number of participants tracked for that month. Now I'm looking to optimize this code by adding a line which would let me resolve my second issue (at this point I don't mind if the code is long but if there could be optimization to be made at all, I'm open to it as well):
os.chdir("XXXXXXX")
economique = pd.read_csv('data_economique.csv')
#JANVIER
ID_jan = economique.query("mois_de_suivi == 'Janvier'")["ID"]
nom_jan = economique.query("mois_de_suivi == 'Janvier'")["nom"]
sexe_jan = economique.query("mois_de_suivi == 'Janvier'")["sexe"]
district_jan = economique.query("mois_de_suivi == 'Janvier'")["district"]
activite_jan = economique.query("mois_de_suivi == 'Janvier'")["activite"]
CA_jan = economique.query("mois_de_suivi == 'Janvier'")["chiffre_affaire"]
charges_jan = economique.query("mois_de_suivi == 'Janvier'")["charges"]
resultat_jan = economique.query("mois_de_suivi == 'Janvier'")["benefice"]
remb_attendu_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_attendu"]
remb_effectue_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_effectue"]
remb_differe_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_remb_differe"]
epargne_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_epargne"]

pandas row manipulation - If startwith keyword found - append row to end of previous row

I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,

Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot

pandas - list of dicts inside a dataframe, keeping their index

I have a dataframe with column values list of dictionaries that looks like this:
id comments
1 [{review:{review_id: 8987, review_text: 'wonderful'}, {review:{review_id: 8988, review_text: 'good'}]
2 [{review:{review_id: 9098, review_text: 'not good'}, {review:{review_id: 9895, review_text: 'terrible'}]
i figured out how to flatten that specific comments by doing:
pd.io.json.json_normalize(json.loads(df['comments'].iloc[0].replace("'", '"')))
It makes a new dataframe from the column value. which is good but what I actually need to happen is the id extends as well like so:
id review_id review_text
1 8987 wonderful
1 8988 good
2 9098 not good
2 9895 terrible
notice that the id extended along with the reviews. How do i Implement a solution to this?
as reference, here is a small sample of the dataset: https://aimedu-my.sharepoint.com/:x:/g/personal/matthewromero_msds2021_aim_edu/EfhdrrlYJy1KmGWhECf91goB7jpHuPFKyz8L3UTfyCSDiA?e=pYcap3

Based on the file that you provided and the way you say you wish the result you can try this code:
import pandas as pd
import ast
#import data
df = pd.read_excel('./restaurants_reviews_sample.xlsx', usecols = [1,2])
#change column to list of dictionaries
df.user_reviews = df.user_reviews.apply(lambda x: list(ast.literal_eval(x)))
#explode the reviews
df = df.explode('user_reviews')
#resetting index
df.reset_index(inplace = True, drop = True)
#unnesting the review dictionary
df.user_reviews = df.user_reviews.apply(lambda x: x['review'])
#creating new columns (only the ones we need)
df = df.assign(id='', review_text='')
#populate the columns from dictionary in user_reviews
cols = list(df.columns[2:4])
for i in list(range(0, len(df))):
for c in cols:
df[c][i] = df.user_reviews[i][c]
#cleaning columns
df.drop(columns = 'user_reviews' , inplace = True)
df.rename(columns = {'id':'review_id',
'index':'id'}, inplace = True)
The new dataframe looks like this:
id review_id review_text
0 6301456 46743270
1 6301456 41974132 A yuppies place, meaning for the young urban poor the place is packed with the young crowd of the early 20Ã¢Â€Â™s and mid 20Ã¢Â€Â™s to early 30Ã¢Â€Â™s that can still take a loud music pumping on the background with open space where you can check out the girls for a possible get to know and possible pick up. Quite affordable for the combo bucket with pulutan for the limited budget crowd but is there to look for a hook up.
2 6301456 38482279 I celebrated my birthday here and it was awesome! My team enjoyed the place, foods and drinks. *tip: if you will be in a group, consider getting the package with cocktail tower and beers plus the platter. It is worth your penny! Kudos to Dylan and JP for the wonderful service that they have provided us and for making sure that my celebration will be a really good one, and it was! Thank you guys! See you again soon! Ã°ÂŸÂ˜ÂÃ°ÂŸÂ˜Â
3 6301456 35971612 Sa lahat nang Central na napuntahan ko, dito ko mas bet! Unang una sa lahat, masarap yung foods and yung pagka gawa ng drinks nila. Hindi pa masyado pala away yung mga customers dito. Ã°ÂŸÂ˜Â‚
4 6301456 35714330 Good place to chill and hang out. Not to mention the comfort room is clean. The staff are quite busy to attend us immediately but they are polite and courteous. Would definitely comeback! Cheers! Ã°ÂŸÂÂºÃ°ÂŸÂ˜ÂŠ
5 6301475 47379863 Underrated chocolate cake under 500 pesos! I definitely recommend this Cloud 9 cake!!! IÃ¢Â€Â™m not into chocolate but this one is good. This cake has a four layers, i loved the creamy white moose part. I ordered it via Grab Food and it was hassle free! Ã°ÂŸÂ˜Â€ The packaging was bad, its just white plastic container, Better handle it with care.
6 6301475 42413329 We loved the Cloud9 cake, its just right taste. We ordered it for our office celebration. However, we went back there to try other food. We get to try a chocolate cake that's too sweet, a cheese cake that's just right, and sansrival that's kind weird and i didnt expect that taste it's sweet and have a lot of nuts and.. i don't know i just didnt feel it. We also hand a lasagna, which is too saucey for is, it's like a soup of tomato. it's a bit disappointing, honestly. Other ordered from our next table looks good, and a lot of serving. They ordered rice meal, maybe you should try that .
7 6301475 42372938 Best cake iÃ¢Â€Â™ve eaten vs cakes from known brands such as Caramia and the like. Lots of white chocolate on top, not so sweet and similar to brazo de mercedes texture and, the merengue is the best!
8 6301475 41699036 This freaking piece of chicken costs 220Php. Chicken Cacciatore. Remember the name. DO NOT ORDER! This was my first time ordering something from your resto and I can tell you I AM NOT HAPPY!
9 6301475 40973213 Heard a lot about their famous chocolate cake. Bought a slice to try but found it quite sweet for my taste. Hope to try their other cakes though.

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)

I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Pythonic way to store and compare csv or xlsx attendance data

Background
I have the following in a .xlsx file when I run a report at work.
A1 - First Name
B1 - Last Name
C1 - Date Attended
Each row contains the data for each person that attended one of our events. I am building a program in python that will take a master.xlsx file and compare it with another .xlsx and give me the following output.
A .txt file with anyone who hasn't attended an event in the past 2 weeks.
A .txt file with anyone who hasn't attended an event in the past 4 weeks.
A .txt file with anyone who hasn't attended an event in 4+ weeks.
A new master.xlsx file with First_Name, Last_Name, Last_Date_Attended
The second .xlsx report is run weekly but actually has a month's worth of attendance data in it. That means if Joe Blow attended 6 times in a month Joe Blow will return 6 rows in the .xlsx file, each with a unique date. So I am going to iterate over the data, compare the dates and only keep the most recent one.
The Question
I have actually already done the above and my first inclination was to turn it into a dictionary in a dictionary. Where the last name is the key with values of 'first', 'date', 'total attended'. Total attended is calculated as part of the for loop.
But a dictionary in a dictionary just doesn't feel pythonic. I feel like I'm hacking a way around a simpler solution. Especially, once I begin writing the output files. Accessing values of a dict in dict doesn't feel right.
Thoughts or suggestions on a better way?
Here's a sample of the code I wrote last night:
data = [This is the data from the .xlsx as a list of lists]
final_data = dict()
dict_errors = 0
for i in data:
if i[1] in final_data:
final_data[i[1]]['total'] = final_data[i[1]]['total'] + 1
if final_data[i[1]]['date'] < i[2]:
final_data[i[1]]['date'] = i[2]
else:
final_data[i[1]] = {
'first': i[0],
'date': i[2],
'total': 1
}
else:
dict_errors += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.