I’m currently doing some big data work. I have an issue in a .CSV where I need to split a multiple-line single-celled chunk of text, into individual cells. The below table shows the desired output. Currently, all of the 'ingredients' are in the same cell, with each ingredient on its own new line (Stack Overflow wouldn't allow me to create new lines in the same cell).
I need to write a script to split this single cell of ingredients into the below output, using each new line in the cell as a delimiter. The real use case I'm using this for is much more complex - over 200 'items', and anywhere between 50-150 'ingredients' per 'item'. I'm currently doing this manually in excel with a series of text to columns & transpose pastes, but it takes approximately 2-2.5 full work days to do.
Link to data
Code below
Item
Ingredients
Coffee
Coffee beans
Milk
Sugar
Water
import pandas as pd
df = pd.read_csv(r'd:\Python\menu.csv', delimiter=';', header=None)
headers = ["Item", "Ingredients"]
df.columns = headers
df["Ingredients"]=df["Ingredients"].str.split("\n")
df = df.explode("Ingredients").reset_index(drop=True)
df.to_csv(r"D:\Python\output.csv")
Using your code and linked data change delimeter to a comma like below.
import pandas as pd
df = pd.read_csv('Inventory.csv', delimiter=',')
df["Software"]=df["Software"].str.split("\n")
df = df.explode("Software").reset_index(drop=True)
# Remove rows having empty string under Software column.
df = df[df['Software'].astype(bool)]
df = df.reset_index(drop=True)
df.to_csv("out_Inventory.csv")
print(df.to_string())
Output
Hostname Software
0 ServerName1 Windows Driver Package - Amazon Inc. (AWSNVMe) SCSIAdapter (08/27/2019 1.3.2.53) [version 08/27/2019 1.3.2.53]
1 ServerName1 Airlock Digital Client [version 4.7.1.0]
2 ServerName1 AppFabric 1.1 for Windows Server [version 1.1.2106.32]
3 ServerName1 BlueStripe Collector [version 8.0.3]
...
Here's how to do it with Python's standard csv^1 ^2 module:
import csv
writer = csv.writer(open('output.csv', 'w', newline=''))
reader = csv.reader(open('input.csv', newline=''))
writer.writerow(next(reader)) # copy header
for row in reader:
item = row[0]
ingredients = row[1].split('\n')
first_ingredient = ingredients[0]
writer.writerow([item, first_ingredient])
for ingredient in ingredients[1:]:
writer.writerow([None, ingredient]) # None for a blank cell (under the item)
Given your small sample, I get this:
Item
Ingredients
Coffee
Coffee beans
Milk
Sugar
Water
Related
I'm new to python (and posting on SO), and I'm trying to use some code I wrote that worked in another similar context to import data from a file into a MySQL table. To do that, I need to convert it to a dataframe. In this particular instance I'm using Federal Election Comission data that is pipe-delimited (It's the "Committee Master" data here). It looks like this.
C00000059|HALLMARK CARDS PAC|SARAH MOE|2501 MCGEE|MD #500|KANSAS CITY|MO|64108|U|Q|UNK|M|C||
C00000422|AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE|WALKER, KEVIN MR.|25 MASSACHUSETTS AVE, NW|SUITE 600|WASHINGTON|DC|200017400|B|Q||M|M|ALABAMA MEDICAL PAC|
C00000489|D R I V E POLITICAL FUND CHAPTER 886|JERRY SIMS JR|3528 W RENO||OKLAHOMA CITY|OK|73107|U|N||Q|L||
C00000547|KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE|JERRY SLAUGHTER|623 SW 10TH AVE||TOPEKA|KS|666121627|U|Q|UNK|Q|M|KANSAS MEDICAL SOCIETY|
C00000729|AMERICAN DENTAL ASSOCIATION POLITICAL ACTION COMMITTEE|DI VINCENZO, GIORGIO T. DR.|1111 14TH STREET, NW|SUITE 1100|WASHINGTON|DC|200055627|B|Q|UNK|M|M|INDIANA DENTAL PAC|
When I run this code, all of the records come back "NaN."
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
df = pd.DataFrame(data, columns=['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID'])
print(df.head(10))
If I remove the dataframe part and just do this, it displays the data, so it doesn't seem like it's a problem with file itself (but what do I know?):
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
print(data.head(10))
I've spent hours looking at different questions here that seem to be trying to address similar issues -- in which cases the problems apparently stemmed from things like the encoding or different kinds of delimiters -- but each time I try to make the same changes to my code I get the same result. I've also converted the whole thing to a csv, by changing all the commas in fields to "$" and then changing the pipes to commas. It still shows up as all "Nan," even though the number of records is correct if I upload it to MySQL (they're just all empty).
You made typos in columns list. Pandas can automatically recognize columns.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
df = pd.DataFrame(data)
print(df.head(10))
Also, you can create an empty dataframe and concatenate the readed file.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
data2 = pd.DataFrame()
df = pd.concat([data,data2],ignore_index=True)
print(df.head(10))
Try this, worked for me:
path = Desktop/Python/FECupdates
df = pd.read_csv(path+'cm.txt',encoding ='unicode_escape', sep='|')
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df.columns = ['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID']
df.head(200)
Output:
I used JotForm Configurable list widget to collect data, but having troubles pwhile parsing or reading the data as the number of records > 2K
The configurable field name is Person Details and the list has these options to take as input,
Name Gender Date of Birth Govt. ID Covid Test Covid Result Type of Follow Up Qualification Medical History Disabilities Employment Status Individual Requirement
A Snap of the excel file, Configurable List Submissions
I want the excel or csv sheet having the data as one column as per the snap be exported into different columns with the list options mentioned above as the heading for each column
I'm very much new to python, pandas or data parsing, and this is for a very important and social benefit project to help people during this time of COVID Crisis , so any help would be gladly appreciated :)
This having the labels in each row isn't something the standard pandas tools like read_csv handle natively. I would iterate through the rows as text strings, and then build the dataframe one row at a time. We will do this by getting each line into the form pd.Series({"Column1": "data", "Column2": "data"...}), and then building a dataframe out of a list of those objects.
import pandas as pd
##Sample Data
data = ["Column1: Data1, Column2: Data2, Column3: Data3", "Column1: Data4, Column2: Data5, Column3: Data6"]
rows = []
##Iterate over rows
for line in data:
##split along commas
split1 = line.split(',')
##
split2 = [s.split(': ') for s in split1]
Now split2 for a row looks like this: [['Column1', ' Data1'], [' Column2', ' Data2'], [' Column3', ' data3']]
##make a series
row = pd.Series({item[0]: item[1] for item in split2})
rows.append(row)
df = pd.DataFrame(rows)
Now df looks like this:
Column1 Column2 Column3
0 Data1 Data2 Data3
1 Data4 Data5 Data6
and you can save it in this format with df.to_csv("filename.csv") and open it in tools like excel.
Thanks in advance! I have been struggling for a few days so that means it is time for me to ask a question. I have a program that is pulling information for three stocks using the module "yfinance" It uses a ticker list in a txt file. I can get the intended information into a data frame for each ticker in the list using a for loop. I then want to save information for each separate ticker on its own sheet in an Excel book with the sheet name being the ticker. As of now I end up creating three distinct data frames but the Excel output only has one tab with the last requested ticker information (MSFT). I think I may need to use an append process to create a new tab with each data frame information, thanks for any suggestions.
Code
import platform
import yfinance as yf
import pandas as pd
import csv
# check versions
print('Python Version: ' + platform.python_version())
print('YFinance Version: ' + yf.__version__)
# load txt of tickers to list, contains three tickers
tickerlist = []
with open('tickers.txt') as inputfile:
for row in csv.reader(inputfile):
tickerlist.append(row)
# iterate through ticker txt file
for i in range(len(tickerlist)):
tickersymbol = tickerlist[i]
stringticker = str(tickersymbol)
stringticker = stringticker.replace("[", "")
stringticker = stringticker.replace("]", "")
stringticker = stringticker.replace("'", "")
# set data to retrievable variable
tickerdata = yf.Ticker(stringticker)
tickerinfo = tickerdata.info
# data items requested
investment = tickerinfo['shortName']
country = tickerinfo['country']
# create dataframes from lists
dfoverview = pd.DataFrame({'Label': ['Company', 'Country'],
'Value': [investment, country]
})
print(dfoverview)
print('-----------------------------------------------------------------')
#export data to each tab (PROBLEM AREA)
dfoverview.to_excel('output.xlsx',
sheet_name=stringticker)
Output
Python Version: 3.7.7
YFinance Version: 0.1.54
Company Walmart Inc.
Country United States
Company Tesla, Inc.
Country United States
Company Microsoft Corporation
Country United States
Process finished with exit code 0
EDITS: Deleted original to try and post to correct forum/location
If all of your ticker information is in a single data frame, Pandas groupby() method works well for you here (if I'm understanding your problem correctly). This is pseudo, but try something like this instead:
import pandas as pd
# df here represents your single data frame with all your ticker info
# column_value is the column you choose to group by
# this column_value will also be used to dynamically create your sheet names
ticker_group = df.groupby(['column_value'])
# create the writer obj
with pd.ExcelWriter('output.xlsx') as writer:
# key=str obj of column_value, data=dataframe obj of data pertaining to key
for key, data in ticker_group:
ticker_group.get_group(key).to_excel(writer, sheet_name=key, index=False)
I have a txt file with info inside of it, separated for every deal with \n symbol.
DEAL: 896
CITY: New York
MARKET: Manhattan
PRICE: $9,750,000
ASSET TYPE: Rental Building
SF: 8,004
PPSF: $1,218
DATE: 11/01/2017
Is there any way to make a csv (or another) table with headers, specified like CITY, MARKET, etc. with pandas or csv module? All the info from specific title should go into corresponding header
Updated to navigate around using : as a delimiter:
import pandas as pd
new_temp = open('temp.txt', 'w') # writing data to a new file changing the first delimiter only
with open('fun.txt') as f:
for line in f:
line = line.replace(':', '|', 1) # only replace first instance of : to use this as delimiter for pandas
new_temp.write(line)
new_temp.close()
df = pd.read_csv('temp.txt', delimiter='|', header=None)
df = df.set_index([0]).T
df.to_csv('./new_transposed_df.csv', index=False)
Will make a csv with the left column as headers and the right column as data without changing colons in the second column. It will write out a temp file called temp.txt which you can remove after you run the program.
Use Pandas to input it and then transform/pivot your table.
import pandas as pd
df = pd.read_csv('data.txt',sep=':',header=None)
df = df.set_index(0).T
Example
import pandas as pd
data = '''
DEAL: 896
CITY: New York
MARKET: Manhattan
PRICE: $9,750,000
ASSET TYPE: Rental Building
SF: 8,004
PPSF: $1,218
DATE: 11/01/2017
'''
df = pd.read_csv(pd.compat.StringIO(data),sep=':',header=None)
print(df.set_index(0).T)
Results:
I have a small issue while trying to parse some data from a table. My program reads a row of the table and then puts it in a list as a string (Python does this as default with a reader.next() function). Everything is fine until there aren't any commas separating some text on the same table space. In this case, the program thinks the comma is a separator and makes 2 list indexes instead of one, and this makes things like list[0].split(';') impossible.
I suck at explaining verbally, so let me illustrate:
csv_file = | House floors | Wooden, metal and golden | 2000 | # Illustration of an excel table
reader = csv.reader(open('csv_file.csv', 'r'))
row = reader.next() # row: ['House floors;Wooden', 'metal and golden; 2000']
columns = row.split(';') # columns: ['House floors, Wooden', 'metal and golden', '2000']
# But obviously what i want is this:
# columns : ['House floors', 'Wooden, metal and golden', '2000']
Thank you very much for your help!
set the delimiter http://docs.python.org/2/library/csv.html
csv.reader(fh, delimiter='|')
You need to set correct delimiter which in your case would be | or ; (not clear from OP's example) e.g.
csv.reader(csvfile, delimiter=';')
Assuming you have data like "House floors;Wooden, metal and golden;2000" you can easily parse it using csv module
import csv
import StringIO
data = "House floors;Wooden, metal and golden;2000"
csvfile = StringIO.StringIO(data)
for row in csv.reader(csvfile, delimiter=';'):
print row
output:
['House floors', 'Wooden, metal and golden', '2000']