Split per attribute - python

I am trying to read a big CSV. Then split big CSV into smaller CSV files, based on unique values in the column team.
At first I created new dataframes for each team. The new txt files generated, one for each unique value in team column.
Code:
import pandas as pd
df = pd.read_csv('combined.csv')
df = df[df.team == 'RED']
df.to_csv('RED.csv')
However I want to start from a single dataframe, read all unique 'teams', and create a .txt file for each team, with headers.
Is it possible?

pandas.DataFrame.groupby, when used without an aggregation, returns the dataframe components associated with each group in the groupby column.
The following code will create a file for the data associated to each unique value in the column used to groupby.
Use f-strings to create a unique filename for each group.
import pandas as pd
# create the dataframe
df = pd.read_csv('combined.csv')
# groupby the desired column and iterate through the groupby object
for group, dataframe in df.groupby('team'):
# save the dataframe for each group to a csv
dataframe.to_csv(f'{group}.txt', sep='\t', index=False)

Related

Python Mapping and Anonymizing Script not doing what it should be

I have a small script that aims to anonymize an excel file using another excel file. More specifically, there is a mastersheet that contains the columns: "sensitive" and "Anonymized_Value". In another excel file called "Raw" there is also a column named "sensitive" that is the same values as the "sensitive" in the mastersheet, so I am trying to replace the "sensitive' in Raw with "Anonymized_Value" from mastersheet (Note all values in "sensitive" are unique with its own unique "Anonymized_Value"
import pandas as pd
# Load the Application_Master_Anonymizer.xlsx file into a pandas dataframe
master_df = pd.read_excel("Master_Anonymizer.xlsx")
# Create a dictionary mapping the "sensitive" to "Anonymized_Value"
sensitive_dict = dict(zip(master_df["sensitive"], master_df["Anonymized_Value"]))
# Load the raw dataset into a pandas dataframe
raw_df = pd.read_excel("Raw.xlsx")
# Check for a column that contains "acronym" (case-insensitive)
sensitive_column = [col for col in raw_df.columns if "sensitive" in col][0]
# Replace the values in the "sensitive" column with "Anonymized_Value"
raw_df[sensitive_column] = raw_df[sensitive_column].map(sensitive_dict)
# Save the anonymized dataframe to a new excel file
raw_df.to_excel("Anonymized.xlsx", index=False)
When I run it all the formatting of the "Anonymized.xlsx" becomes messed up. More specifically, the column names become bolded and there are columns (whos name do not contain "sensitive") are being altered/blanked out.
Any help?
Thank you

Adding a new column to a dataframe based on the values of another dataframe

I do have two csv files, I am using pandas to read the data.
The train.csv contains values, with headers id, sentiment
87,Positive
10,Positive
7,Neutral
The text.csv contains values, with headers id, text
7,hello, I think the price if high...
87, you can call me tomorow...
....
I would like to insert the text from text.csv into train.csv so the result would be:
87,Positive, you can call me tomorow...
Can any one help with pandas?
import pandas as pd
train= pd.read_csv("train.csv")
text= pd.read_csv("text.csv")
# this does not work
combined= pd.merge(train, text, on=['id'])
Note Some Ids may not be in the files, so I need to set null if the id does not exists
set the indices on the two dataframes, then add the columns:
train.set_index('id').sentiment + text.set_index('id').text
One of the easy way can be
pd.merge(train, test, on='id', how='outer')
As per pandas docs, if you use how as outer, it will take all keys

Python how to search value in one csv based on another csv - pandas?

I'm trying to create a program that needs to search a csv file for matching values in another csv file.
Here is what I have so far:
import pandas as pd
import numpy as np
listings = pd.read_csv("data/listings.csv")
inventoryValue = pd.read_csv("data/inventoryValue.csv")
#get rid of rows with empty values in column 'Item Number'
listings['Item Number'].replace('', np.nan, inplace=True)
listings.dropna(subset=['Item Number'], inplace=True)
#get rid of rows with empty values in column 'AvgCost'
inventoryValue['Avg Cost'].replace('', np.nan, inplace=True)
inventoryValue.dropna(subset=['Avg Cost'], inplace=True)
#here how can I search for all of the rows in inventoryValue[Item Number] based on Listings[Item Number]
So basically I need to use Item Number column in listings to find rows with matching Item Number in inventoryValue, from there I can get the columns I need in Inventory Value and save the file.
Any help is much appreciated!
I believe what you want can be achieved using isin. This method is used to filter data frames by selecting rows with having a particular value in a particular column.
In your case, you can create a list that contains all the unique values of Listings['Item Number'], and then check which of the elements are present in inventoryValue['Item Number'], and return back a reduced dataframe:
my_list = listings['Item Number'].unique().tolist()
new_inventoryValue = inventoryValue[inventoryValue['Item Number'].isin(my_list)]
Which will return back a smaller dataframe (row-wise), with all the columns, but your 'Iterm Number'' column will have only the elements in my_list.

How to make multiple json processing faster using pySpark?

I am having a list of json files within Databricks and what I am trying to do is to read each json, extract the values needed and then append that in an empty pandas dataframe. Each json file corresponds to one row on the final dataframe. The initial json filelist length is 50k. What I have built so far is the function below which does the job perfectly, but it takes so much time that it makes me subset the json filelist in 5k bins and run each one separately. It takes 30mins each. I am limited to use only a 3-node cluster in Databricks.
Any chance that you could improve the efficiency of my function? Thanks in advance.
### Create a big dataframe including all json files ###
def jsons_to_pdf(all_paths):
# Create an empty pandas dataframes (it is defined only with column names)
pdf = create_initial_pdf(samplefile)
# Append each row into the above dataframe
for path in all_paths:
# Create a spark dataframe
sdf = sqlContext.read.json(path)
# Create a two extracted lists of values
init_values = sdf.select("id","logTimestamp","otherTimestamp").rdd.flatMap(lambda x: x).collect()
id_values = sdf.select(sdf["dataPoints"]["value"]).rdd.flatMap(lambda x: x).collect()[0]
#Append the concatenated list each one as a row into the initial dataframe
pdf.loc[len(pdf)] = init_values + id_values
return pdf
One json file looks like the following:
And what I want to achieve is to have dataPoints['id'] as new columns and dataPoints['value'] as their value, so as to end up into this:
According to your example, what you want to perform is a pivot and then transform your data into a pandas dataframe.
The steps are :
Collect all you jsons into 1 big dataframe,
pivot your data,
transform them into a pandas dataframe
Try something like this :
from functools import reduce
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = reduce(
lambda a,b : a.union(b),
[
sqlContext.read.json(path)
for path
in all_paths
]
)
# select and pivot your data
pivot_df = sdf.select(
"imoNo",
"logTimestamp",
"payloadTimestamp",
F.explode("datapoints").alias("datapoint")
).groupBy(
"imoNo",
"logTimestamp",
"payloadTimestamp",
).pivot(
"datapoint.id"
).sum("datapoint.value")
# convert to a pandas dataframe
pdf = pivot_df.toPandas()
return pdf
According to your comment, you can replace the list of files all_paths with a generic path and change the way you create sdf:
all_paths = 'abc/*/*/*' # 3x*, one for year, one for month, one for day
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = sqlContext.read.json(path)
This will surely increase the performances.

Issue related to pandas to_csv function applied to grouped DataFrame

I have a pandas DataFrame df with two variables in it: a string variable str1 and a floating point numerical variable fp1. I group that DataFrame like so:
dfg = df.groupby(pandas.qcut(df['fp1'],4,labels=['g1','g2','g3','g4']))
I want to write out the grouped results to a csv file. When I try:
dfg.to_csv('dfg.csv')
the csv file contains observations only for group g4. How can I get the to_csv method to write out the whole grouped DataFrame dfg?

Categories

Resources