Working and grouping data from csv files - using pandas

Working and grouping data from csv files - using pandas - python

In relation to this topic Convert data from CSV to list of dict, I would like to ask a few additional questions regarding the processing of data contained in the CSV file.
How best to rename values for "school_subject" with condition: where term = 2 and school_subject == "foreign language" - I would like rename "foreign language" to "oth_lang" - I am looking the best performance way to do this. I could make a loop and change the value, but it is the easiest way but not the best,
Partial answer for 1 question.
df.loc[(df['school_subject'].str.contains('foreign language')) & (df['term'] == '2'), 'school_subject'] = 'oth_language'
Is it possible to put another group of conditions in the same "loc" for example to (df['school_subject'].str.contains('Informatics'))? In current version I need to create 2 lines of similar code with next conditions.
#jezrael helped me create the correct "query" for grouping data. How can we:
exclude data using code from linked code? I need separate data by term?
how to join data - ex. the students from term 1 and 2 I would like joining like a one year
Thank you for help and potential example of code.

Related

Looping by certain columns and classification by keyword?

So I've been working on data classification as part of a research project but since there are thousands of different values, I thought it best to use python to simplify the process rather than going through each record and classifying it manually.
So basically, I have a dataframe wherein one column is entitled "description" and another is entitled "codes". Each row in the "description" column contains a survey response about activities. The descriptions are all different but might contain some keywords. I have a list of some 40 codes to classify each row based on the text. I was thinking of manually creating some columns in the csv file and in each column, typing a keyword corresponding to each of the codes. Then, a loop (or function with a loop) is applied to the dataframe that goes through each row and if a specific substring is found that corresponds to any of the keywords, and then updated the "codes" column with the code corresponding to that keyword.
My Dilemma
For example:
Suppose the list of codes is "Dance", "Nap", "Run", and "Fight" that are in a separate dataframe column. This dataframe also with the manually entered keyword columns is shown below (can be more than two but I just used two for illustration purposes).
This dataframe is named "classes".
category
Keyword1
Keyword2
Dance
dance
danc
Nap
sleep
slept
Run
run
quick
Fight
kick
unch
The other dataframe is as follows with the "codes" column initially blank.
This dataframe is named "data".
description
codes
Iwasdancingthen
She Slept
He was stealing
The function or loop will search through the "description" column above and check if the keywords are in a given row. If they are, the corresponding codes are applied (as shown in the resulting dataframe below in bold). If not, the row in the "codes" column is left blank. The loop should run as many times as there are Keyword columns; the loop will run twice in this case since there are two keyword columns.
description
codes
Iwasdancingthen
Dance
She Slept
Sleep
He landed a kick
Fight
We are family
FYI: The keywords don't actually have to be complete words. I'd like to use partial words too as you see above.
Also, it should be noted that the loop or function I want to make should account for case sensitivity and strings that are combined.
I hope you understand what I'm trying to do.
What I tried:
At first, I tried using a dictionary and manipulate it somehow. I used the advice here:
search keywords in dataframe cell
However, this didn't work too well as I had many "Nan" values pop up and it became too complicated, so I tried a different route using lists. The code I used was based off another user's advice:
How to conditionally update DataFrame column in Pandas
Here's what I did:
# Create lists from the classes dataframe
Keyword1list = classes["Keyword1"].values.tolist()
Category = classes["category"].values.tolist()
I then used the following loop for classification
for i in range(len(Keyword1list)):
data.loc[data["description"] == Keyword1list[i] , "codes"] = Category[i]
However, the resulting output still gives me "Nan" for all columns. Also, I don't know how to loop over every single keyword column (in this case, loop over the two columns "Keyword1" and "Keyword2").
I'd really appreciate it if anyone could help me with a function or loop that works. Thanks in advance!
Edit: It was pointed out to me that some descriptions might contain multiple keywords. I forgot to mention that the codes in the "classes" dataframe are ordered by rank so that the ones that appear first on the dataframe should take priority; for example, if both "dance" and "nap" are in a description, the code listed higher in the "classes" dataframe (i.e. dance) should be selected and inputted into the "codes" column. I hope there's a way to do that.

Is there a pandas function to merge 2 dfs so that repeating items in the second df are added as columns to the first df?

I have a hard time to formulate this problem in abstract terms, therefore I will mostly try to explain it with examples.
I have 2 pandas dataframes (I get them from a sqlite DB).
First DF:
Second DF:
So the thing is: There are several images per "capture". I would like to add the images to the capture df as columns, so that each capture has 9 image columns, each with a path. There are always 9 images per capture.
I solved it in pandas with what I know in the following way:
cam_idxs = sorted(list(range(9)) * 2)
for cam_idx in cam_idxs:
sub_df = images.loc[(images["camera_id"]==cam_idx)]
captures = captures.merge(sub_df[["image", "capture_id"]], left_on="id",
right_on="capture_id")
I imagine though that there must be a better way. At least I imagine people probably stumble into this problem more often when getting data from a sql database.
Since I am getting the data into pandas from a sql database, I am also open to SQL commands that get me this result. And I'm also grateful for people telling me what this kind of operation is called, I did not find a good way to google for this, therefore I am asking here. Excuse me when this question was asked somewhere, I did not find anything with my searchterms.
So the question at the end is: Is there a better way to do this, especially a more efficient way to do this?

What you are looking for is the pivot table.
You just need to create a column containing the index of the number of image by capture_id that you will use as columns in the pivot table.
For example this could be :
images['column_pivot'] = [x for x in range(1,10)]*int(images.shape[0]/9)
In your case 'column_pivot' would be [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9...7,8,9] (e.g. rolling from 1 to 9)
Then you pivot :
pd.pivot_table(images, columns='column_pivot', index='capture_id', values='image')
This will give the expected result.

How to calculate the sum of conditional cells in excel, populate another column with results

EDIT: Using advanced search in Excel (under data tab) I have been able to create a list of unique company names, and am now able to SUMIF based on the cell containing the companies name!
Disclaimer: Any python solutions would be greatly appreciated as well, pandas specifically!
I have 60,000 rows of data, containing information about grants awarded to companies.
I am planning on creating a python dictionary to store each unique company name, with their total grant $ given (agreemen_2), and location coordinates. Then, I want to display this using Dash (Plotly) on a live MapBox map of Canada.
First thing first, how do I calculate and store the total value that was awarded to each company?
I have seen SUMIF in other solutions, but am unsure how to output this to a new column, if that makes sense.
One potential solution I thought was to create a new column of unique company names, and next to it SUMIF all the appropriate cells in col D.
PYTHON STUFF SO FAR
So with the below code, I take a much messier looking spreadsheet, drop duplicates, sort based on company name, and create a new pandas database with the relevant data columns:
corp_df is the cleaned up new dataframe that I want to work with.
and recipien_4 is the companies unique ID number, as you can see it repeats with each grant awarded. Folia Biotech in the screenshot shows a duplicate grant, as proven with a column i did not include in the screenshot. There are quite a few duplicates, as seen in the screenshot.
import pandas as pd
in_file = '2019-20 Grants and Contributions.csv'
# create dataframe
df = pd.read_csv(in_file)
# sort in order of agreemen_1
df.sort_values("recipien_2", inplace = True)
# remove duplicates
df.drop_duplicates(subset='agreemen_1', keep='first', inplace=True)
corp_dict = { }
# creates empty dict with only 1 copy of all corporation names, all values of 0
for name in corp_df_2['recipien_2']:
if name not in corp_dict:
corp_dict[name] = 0
# full name, id, grant $, longitude, latitude
corp_df = df[['recipien_2', 'recipien_4', 'agreemen_2','longitude','latitude']]
any tips or tricks would be greatly appreciated, .ittertuples() didn't seem like a good solution as I am unsure how to filter and compare data, or if datatypes are preserved. But feel free to prove me wrong haha.
I thought perhaps there was a better way to tackle this problem, straight in Excel vs. iterating through rows of a pandas dataframe. This is a pretty open question so thank you for any help or direction you think is best!

I can see that you are using pandas to read de the file csv, so you can use the method:
Group by
So you can create a new dataframe making groupings for the name of the company like this:
dfnew = dp.groupby(['recipien_2','agreemen_2']).sum()
Then dfnew have the values.
Documentation Pandas Group by:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The use of group_by followed by a sum may be the best for you:
corp_df= df.group_by(by=['recipien_2', 'longitude','latitude']).apply(sum, axis=1)
#if you want to transform the index into columns you can add this after as well:
corp_df=corp_df.reset_index()

Pandas dataframe- create new list column consisting of aggregation of strings from grouped column

I've been struggling with this one a bit and am feeling a bit stuck.
I have a dataframe consisting of data like this, named merged_frames (it is a single frame, created by concatenating a handful of frames with the same shape):
fqdn source
0 site1.org public_source_a
1 site2.org public_source_a
2 site3.org public_source_a
3 site1.org public_source_b
4 site4.org public_source_b
5 site1.org public_source_b
6 site4.org public_source_d ...
7 site1.org public_source_c
...
What I am trying to do is create a new column in this frame that contains a list (ideally a Python list as opposed to a command delimited string) of the sources when grouping by the fqdn value. For example, the data produced for the fqdn value site1.org should look like this based on this example data (this is just a subset of what I would expect, there should also be rows for the other fqdn values as well)
fqdn source_list source
site1.org [public_source_a, public_source_b, public_source_c] public_source_a
site1.org [public_source_a, public_source_b, public_source_c] public_source_b
site1.org [public_source_a, public_source_b, public_source_c] public_source_c
site1.org [public_source_a, public_source_b, public_source_c] public_source_d
Once I have the data in this form, I will simply drop the source column and then use drop_duplicates(keep='first') to get rid of all but one.
I dug up some old code that I used to do something similar about 2 years ago and it is not working as I expected it to. It's been quite a while since I've done something like this with Pandas. What I had was along the lines of:
merged_frame['source_list'] = merged_frame.groupby(
'fqdn', as_index=False)[['source']].aggregate(
lambda x: list(x))['source']
This is behaving very strangely. While it is in fact creating source_list as a list/array, the data in the column is not correct. Additionally, quite a few fqdn values have a null/NaN value for source_list
I have a feeling that this I need to approach this completely different. A little help with this would be appreciated, I'm completely blocked now and am not making any progress with it, despite having what I thought were very relevant example blocks of code I used on a similar dataset.
EDIT:
I have made a little progress by just starting with the fundamentals and have the following, though this joins the strings together rather than making them a list:
merged_frame['source_list'] = merged_frame.groupby('fqdn').source.transform(','.join)
I'm pretty sure with a simply apply here I can split them back into a list. But what would be the correct way to do this in one shot so that I don't need to do the unnecessary join and then apply(split(','))?

Create the data frame from the example above:
df=pd.DataFrame({'fqdn':['site1.org','site2.org','site3.org','site1.org','site4.org','site1.org','site4.org','site1.org'],\
'source':['public_source_a','public_source_a','public_source_a','public_source_b','public_source_b','public_source_b',\
'public_source_d','public_source_c']})
Use groupby and apply(list)
df_grouped=df.groupby('fqdn')['source'].unique().apply(list).reset_index()
Merge with original df and rename columns
result=pd.merge(df,df_grouped,on='fqdn',how='left')
result.rename(columns={'source_x':'source','source_y':'source_list'},inplace=True)

Advance processing multiple Data Frames in python

I got a few (15) data frames. They contain values based on one map, but they have fragmentary form.
List of samples looks like A1 - 3k records, A2 - 6k records. B1 - 12k records, B2- 1k records, B3 - 3k records. C1... etc.
All files have the same format and it looks that:
name sample position position_ID
String1 String1 num1 num1
String2 String2 num2 num2
...
All files come from a variety of biological microarrays. Different companies have different matrices, hence the scatter in the size of files. But each of them is based on one common, whole database. Just some of the data from the main database is selected. Therefore, individual records can be repeated between files. I want to see if they are compatible.
What do I want to achieve in this task?
I want to check that all records are the same in terms of name in all files have the same position and pos_ID values.
If the tested record with the same name differs in values in any file, it must be written to error.csv.
If it is everywhere the same - result.csv.
And to be honest I do not know how to bite it, so I am guided here with a hint that someone is taking good advise to me. I want do it in python.
I have two ideas.
Load in Pandas all files as one data frame and try to write a function filtering whole DF record by record (for loop with if statements?).
Open separate all files by python read file and adding unique rows to the new list, and when read function would encounter again the same recordName, it would check it with previous. If all rest of values are tha same it will pass it without writing, if no, the record will be written in error.csv.
I am afraid, however, that these may not be the most optimal methods, hence asking you for advice and directing me for something better? I have read about numpy, I have not studied it yet, but maybe it is worth it to be in the context of this task? Maybe there is a function that has already been created for this, and I do not know about it?
Can someone help a more sensible (maybe easier) solution?

I think I have a rough idea of where you are going. This is how I would approach it
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df1["filename"] ="file1.csv"
df2["filename"] ="file2.csv"
df_total = pd.concat([df1,df2],axis=1) # stacks them vertically
df_total_no_dupes = df_total.drop_duplicates() # drops duplicate rows
# this gives you the cases where id occures more than once
name_counts = df_total_no_dupes.groupby("name").size().reset_index(name='counts')
names_which_appear_more_than_once = name_counts[name_counts["counts"] > 1]["name"].unique()
filter_condition = df_total_no_dupes["name"].isin(names_which_appear_more_than_once)
# this should be your dataframe where there are at least two rows with same name but different values.
print(df_total_no_dupes[filter_condition].sort_values("name"))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.