Import data from text file with multiple conditions using Pandas - python

I'm trying to parse this text file using Pandas data frame.
The text file is in this particular format:
Name: Tom
Gender: Male
Books:
The problem of Pain
The reason for God: belief in an age of skepticism
My code so far to import the data is:
import pandas as pd
df = pd.read_table(filename, sep=":|\n", engine='python', index_col=0)
print df
The output I got is:
Name Tom
Gender Male
Books NaN
The problem of Pain NaN
The reason for God belief in an age of skepticism
How should I change the code such that the output I get will be: (edited output)
Name Gender Books
Tom Male The problem of Pain, The reason for God: belief in an age of skepticism
Thanks for helping!

You can do two things. You can use enumerate(), and use an if statement:, I used a text file named test.txt in the below code.
import pandas as pd
d = {}
value_list = []
for index, text in enumerate(open('test.txt', "r")):
if index < 2:
d[text.split(':')[0]] = text.split(':')[1].rstrip('\n')
elif index ==2:
value = text.split(':')[0]
else:
value_list.append(text.rstrip('\n'))
d[value] = [value_list]
df = pd.DataFrame.(d)
Instead you can use readlines() and then slice through each line to get and populate the dictionary and then create a dataframe.
import pandas as pd:
text_file = open('test.txt', "r")
lines = text_file.readlines()
d = {}
d[lines[0:1][0].split(':')[0]] = lines[0:1][0].split(':')[1].rstrip('\n')
d[lines[1:2][0].split(':')[0]] = lines[1:2][0].split(':')[1].rstrip('\n')
d[lines[2:3][0].split(':')[0]] = [lines[3:]]
df = pd.DataFrame(d)

The method I use is simple: regex.
import os, re
import pandas as pd
# List out the all files in dir that ends with .txt
files = [file for file in os.listdir(PROFILES) if file.endswith(".txt")]
HEADERS = ['Name', 'Gender', 'Books']
DATA = [] # create the empty list to store profiles
for file in files: # iterate over each file
filename = PROFILES + file # full path name of the data files
text_file = open(filename, "r") # open the file
lines = text_file.read() # read the file in memory
text_file.close() # close the file
###############################################################
# Regex to filter out all the column header and row data. ####
# Odd Number == Header, Even Number == Data ##################
###############################################################
books = re."(Name):(.*)\n+(Gender):(.*)\n+(Books):((?<=Books:)\D+)",lines)
# append data into DATA list
DATA.append([books.group(i).strip() for i in range(len(books.groups()) + 1) if not i % 2 and i != 0])
profilesDF = pd.DataFrame(DATA, columns=HEADERS) # create the dataframe

Related

Create csv file using python, where all the values are seperated after first spacing and creates one column

I need help to convert simple_line.txt file to csv file using the pandas library. However, I am unable to categorize image file where i want to create all the values after first space in one column.
Here is the file (sample_list.txt), listed row by row:
Image Label
doc_pres223.jpg Durasal
doc_pres224.jpg Tab Cefepime
doc_pres225.jpg Tab Bleomycin
doc_pres226.jpg Budesonide is a corticosteroid,
doc_pres227.jpg prescribed for inflammatory,
I want the csv file to be like-
enter image description here
txt_file = r"./example.txt"
csv_file = r"./example.csv"
separator = "; "
with open(txt_file) as f_in, open(csv_file, "w+") as f_out:
for line in f_in:
f_out.write(separator.join(line.split(" ", maxsplit=1)))
try this:
import pandas as pd
def write_file(filename, output):
df = pd.DataFrame()
lines = open(filename, 'r').readlines()
for l in range(1, len(lines)):
line = lines[l]
arr = line.split(" ", maxsplit=1)
image_line = arr[0]
label_line = arr[1].replace('\n', '')
df = df.append({'Image': image_line, 'Label': label_line}, ignore_index=True)
df.to_csv(output)
if __name__ == '__main__':
write_file('example.txt', 'example.csv')
If the filenames in column Image is always the same length, then you could just treat is as a fixed width file. So the first column would be 15 characters, and the rest is the second column. Then just add two empty columns and write it to a new file.
# libraries
import pandas as pd
# set filename
filename = "simple_line.txt"
# read as fixed width
df = pd.read_fwf(filename, header=0, widths=[15, 100])
# add 2 empty columns
df.insert(1, 'empty1', '')
df.insert(2, 'empty2', '')
# save as a new csv file
filenew = "output.csv"
df.to_csv(filenew, sep=';', header=True, index=False)

Insert a line between existing lines in a CSV file using python

I am creating a script that writes lines to a CSV file using Python.
For now, my script writes the CSV in this format:
Title row
Value1;Value2;.... (more than 70)
Title row2
Value1;Value2;...
I just want to be able to read the file again and insert a line of values in between rows, like the following:
Title row
Value1;Value2;.... (more than 70)
Value1;Value2;....
Title row2
Value1;Value2;...
Do you have any ideas?
import csv
with open('csvfile.csv', mode='w') as csv_file:
fieldnames = ['Title', 'row']
writer = csv.DictWriter(csv_file, delimiter=';',fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'Title': 'Value1', 'row': 'Value2'})
I think you can try getting the index of the row with Header Titles and then use append to add new rows and then again combine the two dataframes. Here is the code that might work for you.
import pandas as pd
# intialise data of lists.
data = {'Title':['Tom', 'nick', 'krish', 'jack','Title','Harry'],
'Row':[20, 21, 19, 18,'Row',21]}
new_data = {'Title':['Rahib'],
'Row':[25]}
# Create DataFrame
df = pd.DataFrame(data)
new_df = pd.DataFrame(new_data)
#print(df)
index = df[df['Title'] == 'Title'].index.values.astype(int)[0]
upper_df = df.loc[:index-1]
lower_df = df.loc[index+1:]
upper_df = upper_df.append(new_df)
upper_df = upper_df.append(lower_df).reset_index(drop=True)
print(upper_df)
This will return following dataframe.
Title Row
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
4 Rahib 25
5 Harry 21
Thanks to #harshal's answer and #Rahib's answer, but I've already tried pandas for a long time, and can't seem to get it to work, looks like it's not really suited for my CSV format.
Finally, I've looked at the posts provided by #Serge Ballesta, and in fact, a simple readlines and the retrieval of the line index is a pretty simple trick:
with open(output) as myFile:
for num, line in enumerate(myFile, 1):
if lookup in line:
index = num
f = open(output, "r")
contents = f.readlines()
value=';'.join(value)
f.close()
contents.insert(index+1, str(value)+'\n')
f = open(output, "w")
contents = "".join(contents)
f.write(contents)
f.close()
With output being the name of the file (passed in parameters), value being a list of values (joined for a string with ";" as delimiter), and lookup being the string i was looking for (the title row)

transform Text file to csv with new columns

I have a text file with a list of names and numbers.
Example of text file format:
james 500
Katrina 200
kyle 600 etc
I want to create a csv file from this text file with 2 columns, Names and count where name holds names and count hold the numbers. Below is what ive tried so far,
import csv
class csvTest(object):
def __init__(self):
self.convertToCSV()
def convertToCSV(self):
names = []
with open('BoysNames.txt', 'r') as b_names, open('popular_names.csv', 'w') as out_file, open('GirlsNames.txt', 'r') as g_names:
for b_lines in b_names:
b_lines = b_lines.strip().split('\t')
names.append(b_lines)
#for g_lines in g_names:
# g_lines = g_lines.split('\t')
# names.append(g_lines)
writer = csv.writer(out_file)
writer.writerow(('FirstName', 'Count'))
writer.writerows(names)
if __name__ == '__main__':
csvTest()
I'm not able to split the columns properly it all goes into names. Please help.
How about this solution with pandas. First let's create some sample data:
import io
# first let's recreate your file
data1 = '''\
james 500
kyle 600'''
data2 ='''\
Katrina 200'''
file1 = io.StringIO(data1)
file2 = io.StringIO(data2)
Now we do the real operation:
import pandas as pd
#Let's put them in a list
#This list should in reality be changed to ["path/to/file1", path/to/file2 ...
files = [file1,file2]
# now let's read this data with pandas to a dataframe
names = ["FirstName","Count"]
df = pd.concat(pd.read_csv(f, sep=" ", header=None, names=names) for f in files)
# and output to csv:
df.to_csv("output.csv", sep=",", index=False)
Result 'output.csv':
FirstName,Count
james,500
kyle,600
Katrina,200

Not all data in a column is being copied to another csv file

So I have two csv files. One is in the following format:
last name, first name, Number
The other is in this format:
number, quiz
I want to create a new output file that takes these two csv files and gives me a file in the following format:
last name, first name, number, quiz.
I have tried the follwoing code and it works, but only for the first person listed in the first two input files. I am not sure what I am doing wrong. Also, I do not want to assume that the two input files follow the same order.
import sys, re
import numpy as np
import smtplib
from random import randint
import csv
import math
col = sys.argv[1]
source = sys.argv[2]
target = sys.argv[3]
newtarg = sys.argv[4]
input_source = csv.DictReader(open(source))
input_target = csv.DictReader(open(target))
data = {}
t = ()
for row in input_target:
t = row['First Name'], row['number']
for rows in input_source:
if rows['number'] == row['number']:
t = t + (rows[col],)
name = row['Last Name']
data[name] = [t]
rows.next()
row.next()
with open(newtarg,'w') as out:
csv_out=csv.writer(out)
for key, val in data.items():
csv_out.writerow([key] + list(val))
This might be a job for pandas, the Python Data Analysis Library:
import pandas as pd
x1 = pd.read_csv('x1.csv')
x2 = pd.read_csv('x2.csv')
result = pd.merge(x1, x2, on='number')
result.to_csv('result.csv',
index=False,
columns=['Last Name', 'First Name', 'number', 'quiz'])
Reference: https://chrisalbon.com/python/pandas_join_merge_dataframe.html
I think the following will work. Note: I've taken out all the stuff in the code in your question that's not being used (as you should have done before posting it). I've also hardcoded the input values for testing.
import csv
source = 'source1.csv'
target = 'target1.csv'
newtarg = 'new_output.csv'
targets = {}
with open(target) as file:
for row in csv.DictReader(file):
targets[row['number']] = row['quiz']
with open(source) as src, open(newtarg, 'w') as out:
reader = csv.DictReader(src)
writer = csv.writer(out)
writer.writerow(reader.fieldnames + ['quiz']) # create a header row (optional)
for row in reader:
row.update({'quiz': targets.get(row['Number'], 'no match')})
writer.writerow(row.values())

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

Categories

Resources