Python/ Pandas CSV Parsing For Form Responses - python

I used JotForm Configurable list widget to collect data, but having troubles pwhile parsing or reading the data as the number of records > 2K
The configurable field name is Person Details and the list has these options to take as input,
Name Gender Date of Birth Govt. ID Covid Test Covid Result Type of Follow Up Qualification Medical History Disabilities Employment Status Individual Requirement
A Snap of the excel file, Configurable List Submissions
I want the excel or csv sheet having the data as one column as per the snap be exported into different columns with the list options mentioned above as the heading for each column
I'm very much new to python, pandas or data parsing, and this is for a very important and social benefit project to help people during this time of COVID Crisis , so any help would be gladly appreciated :)

This having the labels in each row isn't something the standard pandas tools like read_csv handle natively. I would iterate through the rows as text strings, and then build the dataframe one row at a time. We will do this by getting each line into the form pd.Series({"Column1": "data", "Column2": "data"...}), and then building a dataframe out of a list of those objects.
import pandas as pd
##Sample Data
data = ["Column1: Data1, Column2: Data2, Column3: Data3", "Column1: Data4, Column2: Data5, Column3: Data6"]
rows = []
##Iterate over rows
for line in data:
##split along commas
split1 = line.split(',')
##
split2 = [s.split(': ') for s in split1]
Now split2 for a row looks like this: [['Column1', ' Data1'], [' Column2', ' Data2'], [' Column3', ' data3']]
##make a series
row = pd.Series({item[0]: item[1] for item in split2})
rows.append(row)
df = pd.DataFrame(rows)
Now df looks like this:
Column1 Column2 Column3
0 Data1 Data2 Data3
1 Data4 Data5 Data6
and you can save it in this format with df.to_csv("filename.csv") and open it in tools like excel.

Related

CSV - Split multiple-line cell into multiple cells

I’m currently doing some big data work. I have an issue in a .CSV where I need to split a multiple-line single-celled chunk of text, into individual cells. The below table shows the desired output. Currently, all of the 'ingredients' are in the same cell, with each ingredient on its own new line (Stack Overflow wouldn't allow me to create new lines in the same cell).
I need to write a script to split this single cell of ingredients into the below output, using each new line in the cell as a delimiter. The real use case I'm using this for is much more complex - over 200 'items', and anywhere between 50-150 'ingredients' per 'item'. I'm currently doing this manually in excel with a series of text to columns & transpose pastes, but it takes approximately 2-2.5 full work days to do.
Link to data
Code below
Item
Ingredients
Coffee
Coffee beans
Milk
Sugar
Water
import pandas as pd
df = pd.read_csv(r'd:\Python\menu.csv', delimiter=';', header=None)
headers = ["Item", "Ingredients"]
df.columns = headers
df["Ingredients"]=df["Ingredients"].str.split("\n")
df = df.explode("Ingredients").reset_index(drop=True)
df.to_csv(r"D:\Python\output.csv")
Using your code and linked data change delimeter to a comma like below.
import pandas as pd
df = pd.read_csv('Inventory.csv', delimiter=',')
df["Software"]=df["Software"].str.split("\n")
df = df.explode("Software").reset_index(drop=True)
# Remove rows having empty string under Software column.
df = df[df['Software'].astype(bool)]
df = df.reset_index(drop=True)
df.to_csv("out_Inventory.csv")
print(df.to_string())
Output
Hostname Software
0 ServerName1 Windows Driver Package - Amazon Inc. (AWSNVMe) SCSIAdapter (08/27/2019 1.3.2.53) [version 08/27/2019 1.3.2.53]
1 ServerName1 Airlock Digital Client [version 4.7.1.0]
2 ServerName1 AppFabric 1.1 for Windows Server [version 1.1.2106.32]
3 ServerName1 BlueStripe Collector [version 8.0.3]
...
Here's how to do it with Python's standard csv^1 ^2 module:
import csv
writer = csv.writer(open('output.csv', 'w', newline=''))
reader = csv.reader(open('input.csv', newline=''))
writer.writerow(next(reader)) # copy header
for row in reader:
item = row[0]
ingredients = row[1].split('\n')
first_ingredient = ingredients[0]
writer.writerow([item, first_ingredient])
for ingredient in ingredients[1:]:
writer.writerow([None, ingredient]) # None for a blank cell (under the item)
Given your small sample, I get this:
Item
Ingredients
Coffee
Coffee beans
Milk
Sugar
Water

How to create a multiIndex dataframe from a streaming csv file

I'm streaming data to a csv file.
This is the request:
symbols=["SPY", "IVV", "SDS", "SH", "SPXL", "SPXS", "SPXU", "SSO", "UPRO", "VOO"]
each symbols has a list range from (0,8)
this is how it looks like in 3 columns:
-1583353249601,symbol,SH
-1583353249601,delayed,False
-1583353249601,asset-main-type,EQUITY
-1583353250614,symbol,SH
-1583353250614,last-price,24.7952
-1583353250614,bid-size,362
-1583353250614,symbol,VOO
-1583353250614,bid-price,284.79
-1583353250614,bid-size,3
-1583353250614,ask-size,1
-1583353250614,bid-id,N
my end goal Is to reshape the data:
this is what I need to achieved.
the problems that I encounter where:
not being able to group by tiemstamp and not being able to pivot.
1)I tried to crate a dict and so later It can be passed to pandas, but I m missing data in the process.
I need to find the way to group the data that has the same timestamp.it looks like that omit the lines with the same timestamp.
code:
new_data_dict = {}
with open("stream_data.csv", 'r') as data_file:
data = csv.DictReader(data_file, delimiter=",")
for row in data:
item = new_data_dict.get(row["timestamp"], dict())
item[row["symbol"]] = row["value"]
new_data_dict[row['timestamp']] = item
data = new_data_dict
data = pd.DataFrame.from_dict(data)
data.T
print(data.T)
2)this is an other approach, I was able to group by timestamp by creating 2 different data, but I can not split the value column in to multiple columns to be merge late matching indexes.
code:
data = pd.read_csv("tasty_hola.csv",sep=',' )
data1 = data.groupby(['timestamp']).apply(lambda v: v['value'].unique())
data = data.groupby(['timestamp']).apply(lambda v: v['symbol'].unique())
data1 = pd.DataFrame({'timestamp':data1.index, 'value':data1.values})
At this moment I don't know if the logic that I m trying to apply is the correct one. very lost not being able to see the light at the end of the tunnel.
Thank you very much

How to ignore some commas not inside quotes when using pandas.read_csv?

ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged
1009317190,French Cuisine, A Traditional Experience,Cookbooks,Food,USD,2014-09-08 00:46:23,13730,2014-08-09 03:16:02,3984,failed,46,US,3984
I used pandas.read_csv() to load the csv file above to dataframe. However, my output came like this:
Question: How can I ignore the comma between French Cuisine and A Traditional Experience, and read them into the same column?
You can follow these steps to achieve what you want:
Step1:
df['name'] = df['name']+df['category']
Step2:
data1 = df.iloc[:, :2] # dataframe with columns 'ID' and 'name'
data2 = df.iloc[:, 2:].T.shift(-1,axis=0).T # Shifting multi-column data to the left
data = pd.concat([data1, data2], axis=1) # concat dataframes data1 and data2 along columns
Step3:
data = data.drop('Unnamed:13', 1) # drop column named 'Unnamed:13'
Just open the CSV file as a text file and replace French Cuisine, A Traditional Experience by French Cuisine A Traditional Experience.
csv_file = open("example.csv", 'r').read()
csv_file = csv_file.replace("French Cuisine, A Traditional Experience", "French Cuisine A Traditional Experience")
open("example.csv", 'w').write(csv_file)

Iterating through a csv file and creating a table

I'm trying to read in a .csv file and extract specific columns so that I can output a single table that essentially performs a 'GROUP BY' on a particular column and aggregates certain other columns of interest (similar to how you would in SQL) but I'm not too familiar how to do this easily in Python.
The csv file is in the following form:
age,education,balance,approved
30,primary,1850,yes
54,secondary,800,no
24,tertiary,240,yes
I've tried to import and read in the csv files to parse the three columns I care about and iterate through them to put them into three separate array lists. I'm not too familiar with packages and how to get these into a data frame or matrix with 3 columns so that I can then iterate through them mutate or perform all of the aggregated output field (see below expected results).
with open('loans.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter = ',')
next(readCSV) ##skips header row
education = []
balance = []
loan_approved = []
for row in readCSV:
educat = row[1]
bal = row[2]
approve = row[3]
education.append(educat)
balance.append(bal)
loan_approved.append(approve)
print(education)
print(balance)
print(loan_approved)
The output would be a 4x7 table of four rows (grouped by education level) and the following headers:
Education|#Applicants|Min Bal|Max Bal|#Approved|#Rejected|%Apps Approved
Primary ...
Secondary ...
Terciary ...
It seems to be much simpler by using Pandas instead. For instance, you can read only the columns that you care for instead of all of them:
import Pandas as pd
df = pd.read_csv(usecols=['education', 'balance', 'loan_approved'])
Now, to group by education level, you can find all the unique entries for that column and group them:
groupby_education = {}
for level in list(set(df['education'])):
groupby_education[level] = df.loc[df['education'] == level]
print(groupby_education)
I hope this helped. Let me know if you still need help.
Cheers!

How to store this JSON file in a Pandas data frame?

I have never worked with JSON files before. I have this News Classification dataset. I wanted to get this in a Pandas dataframe.
It looks like this:
{"content": "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.","annotation":{"notes":"","label":["Business"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
{"content": "SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.","annotation":{"notes":"","label":["SciTech"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
There are more entries but I have posted just two of them. Each entry is bracketed as {}. Each entry has 4 keys: 'contents', 'annotations', 'extras', 'metadata'. I would like to have this in dataframe with the above keys as columns.
I tried the json library and Pandas.read_json function but both gave me errors.
with open('News-Classification-DataSet.json') as data_file:
df=json.load(data_file)
This gave an error: JSONDecodeError: Extra data: line 2 column 1 (char 378)
I believe you have to read this file in for each line, as the way you have it, isn't a valid json format.
So to read that in:
import json
data = []
with open('News-Classification-DataSet.json') as f:
for line in f:
data.append(json.loads(line))
Now you should be able to work with that, however, what do you want as your datframe output?
If you want to go straight to a dataframe, you can do as suggested:
df = pd.read_json("News-Classification-DataSet.json", lines=True)
But you have nested columns which I don't know how you want to deal with that.
To load line delimited json into a dataframe,
import pandas as pd
df = pd.read_json("News-Classification-DataSet.json", lines=True)
To parse the dict inside columns into columns,
pd.concat(
[
df["annotation"].apply(pd.Series),
df[["content", "extras"]],
df["metadata"].apply(pd.Series),
],
axis=1,
)

Categories

Resources