Python for loop to read csv using pandas - python

I can combined 2 csv scripts and it works well.
import pandas
csv1=pandas.read_csv('1.csv')
csv2=pandas.read_csv('2.csv')
merged=csv1.merge(csv2,on='field1')
merged.to_csv('output.csv',index=False)
Now, I would like to combine more than 2 csvs using the same method as above.
I have list of CSV which I defined to something like this
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
for i in collection:
csv=pandas.read_csv(i)
merged=csv.merge(??,on='field1')
merged.to_csv('output2.csv',index=False)
I havent got it work so far if more than 1 csv..I guess it just a matter iterate inside the list ..any idea?

You need special handling for the first loop iteration:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = None
for i in collection:
csv=pandas.read_csv(i)
if result is None:
result = csv
else:
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
Another alternative would be to load the first CSV outside the loop but this breaks when the collection is empty:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.read_csv(collection[0])
for i in collection[1:]:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
I don't know how to create an empty document (?) in pandas but that would work, too:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.create_empty() # not sure how to do this
for i in collection:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
result.to_csv('output2.csv',index=False)

Related

Copy the first column of a .csv file into a new file

I know this is a very easy task, but i am acting pretty dumb right now and dont get it solved. I need to copy the first column of a .csv file including header into a newly created file. My code:
station = 'SD_01'
import csv
import pandas as pd
df = pd.read_csv(str( station ) + "_ED.csv", delimiter =';')
list1 = []
matrix1 = df[df.columns[0]].as_matrix()
list1 = matrix1.tolist()
with open('{0}_RRS.csv'.format(station),"r+") as f:
writer = csv.writer(f)
writer.writerows(map(lambda x: [x], list1))
As result, my file has an empty line between the values, has no header (i could continue without the header, though) and something at the bottom which a can not identify.
>350
>
>351
>
>352
>
>...
>
>949
>
>950
>
>Ž‘’“”•–—˜™š›œžŸ ¡¢
Just a short impression of the 1200+ lines
I am pretty sure that this is a very clunky way to do this; easyier ways are always welcome.
How do i get rid of all the empty lines and this crazy stuff in the end?
When you get a column from a dataframe, it's returned as type Series - and the Series has a built in to_csv method you can use. So you don't need to do any matrix casting or anything like that.
import pandas as pd
df = pd.read_csv('name.csv',delimiter=';')
first_column = df[[df.columns[0]]
first_column.to_csv('new_file.csv')

appending multiple pandas DataFrames read in from files

Hello I am trying to read in multiple files, create a dataframe of the specific key information i need and then append each dataframe for each file to a main dataframe called topics. I have tried the following code.
import pandas as pd
import numpy as np
from lxml import etree
import os
topics = pd.DataFrame()
for filename in os.listdir('./topics'):
if not filename.startswith('.'):
#print(filename)
tree = etree.parse('./topics/'+filename)
root = tree.getroot()
childA = []
elementT = []
ElementA = []
for child in root:
elementT.append(str(child.tag))
ElementA.append(str(child.attrib))
childA.append(str(child.attrib))
for element in child:
elementT.append(str(element.tag))
#childA.append(child.attrib)
ElementA.append(str(element.attrib))
childA.append(str(child.attrib))
for sub in element:
#print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
#childA.append(child.attrib)
elementT.append(str(sub.tag))
ElementA.append(str(sub.attrib))
childA.append(str(child.attrib))
df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)
file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic
df = df.iloc[:,3:]
topics.append(df)
However when i call topics i get the following output
topics
Out[19]:_
Can someone please let me know where i am going wrong, also any suggestions on improving my messy code would be appreciated
Unlike lists, when you append to a DataFrame you return a new object. So topics.append(df) returns an object that you are never storing anywhere and topics remains the empty DataFrame you declare on the 6th line. You can fix this by
topics = topics.append(df)
However, appending to a DataFrame within a loop is a very costly exercise. Instead you should append each DataFrame to a list within the loop and call pd.concat() on the list of DataFrames after the loop.
import pandas as pd
topics_list = []
for filename in os.listdir('./topics'):
# All of your code
topics_list.append(df) # Lists are modified with append
# After the loop one call to concat
topics = pd.concat(topics_list)

How to store output of condition in a dataframe

I am trying to store the output of if condition to a Dataframe. Given below is what I am trying:
import os
filename = "Desktop/sales/10-05-2018"
#check file exists
if(os.path.exists(filename)):
print("Files received")
else:
print("No files received")
Instead of printing the output, I would like to store the output to a Dataframe. Could anyone advice on this. Thanks.
This is one way you can store such a mapping in a dataframe.
import os, pandas as pd
df = pd.DataFrame(columns=['filename', 'exists'])
df['file'] = ['file1.csv', 'file2.csv', 'file3.csv']
df['exists'] = df['file'].map(os.path.exists)
This will create a dataframe of filenames in one column and a Boolean series in another indicating whether or not the file exists.
If the filenames are retrieved from an iterable, you should aggregate to a list of lists first before constructing a dataframe. Appending continually to an existing dataframe is inefficient in this situation.
lst = ( ... some iterable ... )
lst_of_lst = [[f, os.path.exists(f)] for f in lst]
df = pd.DataFrame(lst_of_lst, columns=['filename', 'exists'])

How do I save my list to a dataframe keeping empty rows?

I'm trying to extract subject-verb-object triplets and then attach an ID. I am using a loop so my list of extracted triplets keeping the results for the rows were no triplet was found. So it looks like:
[]
[trump,carried,energy]
[]
[clinton,doesn't,trust]
When I print mylist it looks as expected.
However when I try and create a dataframe from mylist I get an error caused by the empty rows
`IndexError: list index out of range`.
I tried to include an if statement to avoid this but the problem is the same. I also tried using reindex instead but the df2 came out empty.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import spacy
import textacy
import csv, string, re
import numpy as np
import pandas as pd
#Import csv file with pre-processing already carried out
import pandas as pd
df = pd.read_csv("pre-processed_file_1.csv", sep=",")
#Prepare dataframe to be relevant columns and unicode
df1 = df[['text_1', 'id']].copy()
import StringIO
s = StringIO.StringIO()
tweets = df1.to_csv(encoding='utf-8');
nlp = spacy.load('en')
count = 0;
df2 = pd.DataFrame();
for row in df1.iterrows():
doc = nlp(unicode(row));
text_ext = textacy.extract.subject_verb_object_triples(doc);
tweetID = df['id'].tolist();
mylist = list(text_ext)
count = count + 1;
if (mylist):
df2 = df2.append(mylist, ignore_index=True)
else:
df2 = df2.append('0','0','0')
Any help would be very appreciated. Thank you!
You're supposed to pass a DataFrame-shaped object to append. Passing the raw data doesn't work. So df2=df2.append([['0','0','0']],ignore_index=True)
You can also wrap your processing in a function process_row, then do df2 = pd.DataFrame([process_row(row) for row in df1.iterrows()]). Note that while append won't work with empty rows, the DataFrame constructor just fills them in with None. If you want empty rows to be ['0','0','0'], you have several options:
-Have your processing function return ['0','0','0'] for empty rows -Change the list comprehension to [process_row(row) if process_row(row) else ['0','0','0'] for row in df1.iterrows()] -Do df2=df2.fillna('0')

Python pandas - Filter a data frame based on a pre-defined array

I'm trying to filter a data frame based on the contents of a pre-defined array.
I've looked up several examples on StackOverflow but simply get an empty output.
I'm not able to figure what is it I'm doing incorrectly. Could I please seek some guidance here?
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements= pd.DataFrame(df.drop_duplicates().Entity.value_counts(),columns=distinct_count_column_headers)
filtered_data= distinct_elements[distinct_elements['Entity'].isin(pre_defined_arr)]
print("Filtered data ... ")
print(filtered_data)
OUTPUT
Filtered data ...
Empty DataFrame
Columns: [Entity]
Index: []
Managed to that using filter function -> .filter(items=pre_defined_arr )
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements_filtered= pd.DataFrame(df.drop_duplicates().Entity.value_counts().filter(items=pre_defined_arr),columns=distinct_count_column_headers)
It's strange that there's just one answer I bumped on that suggests filter function. Almost 9 out 10 out there talk about .isin function which didn't work in my case.

Categories

Resources