appending multiple pandas DataFrames read in from files - python

Hello I am trying to read in multiple files, create a dataframe of the specific key information i need and then append each dataframe for each file to a main dataframe called topics. I have tried the following code.
import pandas as pd
import numpy as np
from lxml import etree
import os
topics = pd.DataFrame()
for filename in os.listdir('./topics'):
if not filename.startswith('.'):
#print(filename)
tree = etree.parse('./topics/'+filename)
root = tree.getroot()
childA = []
elementT = []
ElementA = []
for child in root:
elementT.append(str(child.tag))
ElementA.append(str(child.attrib))
childA.append(str(child.attrib))
for element in child:
elementT.append(str(element.tag))
#childA.append(child.attrib)
ElementA.append(str(element.attrib))
childA.append(str(child.attrib))
for sub in element:
#print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
#childA.append(child.attrib)
elementT.append(str(sub.tag))
ElementA.append(str(sub.attrib))
childA.append(str(child.attrib))
df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)
file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic
df = df.iloc[:,3:]
topics.append(df)
However when i call topics i get the following output
topics
Out[19]:_
Can someone please let me know where i am going wrong, also any suggestions on improving my messy code would be appreciated

Unlike lists, when you append to a DataFrame you return a new object. So topics.append(df) returns an object that you are never storing anywhere and topics remains the empty DataFrame you declare on the 6th line. You can fix this by
topics = topics.append(df)
However, appending to a DataFrame within a loop is a very costly exercise. Instead you should append each DataFrame to a list within the loop and call pd.concat() on the list of DataFrames after the loop.
import pandas as pd
topics_list = []
for filename in os.listdir('./topics'):
# All of your code
topics_list.append(df) # Lists are modified with append
# After the loop one call to concat
topics = pd.concat(topics_list)

Related

Python pandas dataframe set_name length issue expecting one argument but is given two

So I have this prep_dat function and I am giving it the following csv data:
identifier,Hugo_Symbol,Tumor_Sample_Barcode,Variant_Classification,patient
1,patient,a,Silent,6
22,mutated,d,e,7
1,Hugo_Symbol,f,g,88
within this prep_data function, there is this line
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
However, it keeps erring out when it gets to the line saying
ValueError: Length of new names must be 1, got 2
Is there something wrong with the line or is it something wrong with the function
Here is the whole source code
import pandas as pd
import numpy as np
PRIMARY_TUMOR_PATIENT_ID_REGEX = '^.{4}-.{2}-.{4}-01.*'
SHORTEN_PATIENT_REGEX = '^(.{4}-.{2}-.{4}).*'
def mutations_for_gene(df):
mutated_patients = df['identifier'].unique()
return pd.DataFrame({'mutated': np.ones(len(mutated_patients))}, index=mutated_patients)
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]['Tumor_Sample_Barcode'] #Creates a new dataframe of all the data within the 'Tumor_Sample_Barcode' column that does not match the PRIMARY_TUMOR_PATIENT_ID_REGEX
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
gene_mutation_df = gene_mutation_df.reset_index()
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
return gene_patient_mutations.transpose().fillna(0)
Any help would be greatly appreciated( I know this wasn't to specific, Im still trying to work out what this function does exactly and how I could make data to test it)

Appending multiple pandas dataframe using for loop but returns an empty dataframe

I am to download a number of .csv files which I convert to pandas dataframe and append to each other.
The csv can be accessed via url which is created each day and using datetime it can be easily generated and put in a list.
I am able to open these individually in the list.
When I try to open a number of these and append them together I get an empty dataframe. The code looks like this so.
#Imports
import datetime
import pandas as pd
#Testing can open .csv file
data = pd.read_csv('https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin01022018.csv')
data.iloc[:5]
#Taking heading to use to create new dataframe
data_headings = list(data.columns.values)
#Setting up string for url
path_start = 'https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin'
file = ".csv"
#Getting dates which are used in url
start = datetime.datetime.strptime("01-02-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("04-02-2018", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
#Creating list of url
date_list = []
for date in date_generated:
date_string = date.strftime("%d%m%Y")
x = path_start + date_string + file
date_list.append(x)
#Opening and appending csv files from list which contains url
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link)
print(df)
I have checked that they are not just empty csv but they are not. Any help would be appreciated.
Cheers,
Sandy
You are never storing the appended dataframe. The line:
df.append(data_link)
Should be
df = df.append(data_link)
However, this may be the wrong approach. You really want to use the array of URLs and concatenate them. Check out this similar question and see if it can improve your code!
I really can't understand what you wanted to do here:
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
By the way, try this:
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link.copy())

How do I save my list to a dataframe keeping empty rows?

I'm trying to extract subject-verb-object triplets and then attach an ID. I am using a loop so my list of extracted triplets keeping the results for the rows were no triplet was found. So it looks like:
[]
[trump,carried,energy]
[]
[clinton,doesn't,trust]
When I print mylist it looks as expected.
However when I try and create a dataframe from mylist I get an error caused by the empty rows
`IndexError: list index out of range`.
I tried to include an if statement to avoid this but the problem is the same. I also tried using reindex instead but the df2 came out empty.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import spacy
import textacy
import csv, string, re
import numpy as np
import pandas as pd
#Import csv file with pre-processing already carried out
import pandas as pd
df = pd.read_csv("pre-processed_file_1.csv", sep=",")
#Prepare dataframe to be relevant columns and unicode
df1 = df[['text_1', 'id']].copy()
import StringIO
s = StringIO.StringIO()
tweets = df1.to_csv(encoding='utf-8');
nlp = spacy.load('en')
count = 0;
df2 = pd.DataFrame();
for row in df1.iterrows():
doc = nlp(unicode(row));
text_ext = textacy.extract.subject_verb_object_triples(doc);
tweetID = df['id'].tolist();
mylist = list(text_ext)
count = count + 1;
if (mylist):
df2 = df2.append(mylist, ignore_index=True)
else:
df2 = df2.append('0','0','0')
Any help would be very appreciated. Thank you!
You're supposed to pass a DataFrame-shaped object to append. Passing the raw data doesn't work. So df2=df2.append([['0','0','0']],ignore_index=True)
You can also wrap your processing in a function process_row, then do df2 = pd.DataFrame([process_row(row) for row in df1.iterrows()]). Note that while append won't work with empty rows, the DataFrame constructor just fills them in with None. If you want empty rows to be ['0','0','0'], you have several options:
-Have your processing function return ['0','0','0'] for empty rows -Change the list comprehension to [process_row(row) if process_row(row) else ['0','0','0'] for row in df1.iterrows()] -Do df2=df2.fillna('0')

Python for loop to read csv using pandas

I can combined 2 csv scripts and it works well.
import pandas
csv1=pandas.read_csv('1.csv')
csv2=pandas.read_csv('2.csv')
merged=csv1.merge(csv2,on='field1')
merged.to_csv('output.csv',index=False)
Now, I would like to combine more than 2 csvs using the same method as above.
I have list of CSV which I defined to something like this
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
for i in collection:
csv=pandas.read_csv(i)
merged=csv.merge(??,on='field1')
merged.to_csv('output2.csv',index=False)
I havent got it work so far if more than 1 csv..I guess it just a matter iterate inside the list ..any idea?
You need special handling for the first loop iteration:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = None
for i in collection:
csv=pandas.read_csv(i)
if result is None:
result = csv
else:
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
Another alternative would be to load the first CSV outside the loop but this breaks when the collection is empty:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.read_csv(collection[0])
for i in collection[1:]:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
I don't know how to create an empty document (?) in pandas but that would work, too:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.create_empty() # not sure how to do this
for i in collection:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
result.to_csv('output2.csv',index=False)

Using Pandas dataframe with FOR loops

and thank you for looking.
I am trying my hand at modifying a Python script to download a bunch of data from a website. I have decided that given the large data that will be used, I am wanting to convert the script to Pandas for this. I have this code so far.
snames = ['Index','Node_ID','Node','Id','Name','Tag','Datatype','Engine']
sensorinfo = pd.read_csv(sensorpath, header = None, names = snames, index_col=['Node', 'Index'])
for j in sensorinfo['Node']:
for z in sensorinfo['Index']:
# create a string for the url of the data
data_url = "http://www.mywebsite.com/emoncms/feed/data.json?id=" + sensorinfo['Id'] + "&apikey1f8&start=&end=&dp=600"
print data_url
# read in the data from emoncms
sock = urllib.urlopen(data_url)
data_str = sock.read()
sock.close
# data is output as a string so we convert it to a list of lists
data_list = eval(data_str)
myfile = open(feed_list['Name'[k]] + ".csv",'wb')
wr=csv.writer(myfile,quoting=csv.QUOTE_ALL)
The first part of the code gives me a very nice table which means I am opening my csv data file and import the information, my question is this:
So I am trying to do this in pseudo code:
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
smalldata.read.csv(data_url)
merge(bigdata, smalldata.no_time_column)
This is my first post here, I tried to keep it short but still supply the relevant data. Let me know if I need to clarify anything.
In your pseudocode, you can do this:
dfs = []
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
df = smalldata.read.csv(data_url)
dfs.append(df)
df = pd.concat(dfs)

Categories

Resources