I have a csv file with 330k+ rows and 12 columns. I need to put column 1 (numeric ID) and column 3 (text string) into a list or array so I can analyze the data in column 3.
this code worked for me to pull out the third col:
for row in csv_strings:
string1.append(row[2])
Can someone point me to the correct class of commands that I can research to get the job done?
Thanks.
Pandas is the best tool for this.
import pandas as pd
df = pd.read_csv("filename.csv", usecols=[ 0, 2 ])
points = []
for row in csv_strings:
points.append({id: row[0], text: row[2]})
You can pull them out into a list of key value pairs.
A different answer, using tuples, which ensure immutability and are pretty fast, but less convenient than dictionaries:
# build results
results = []
for row in csv_lines:
results.append((row[0], row[2]))
# Read results
for result in results:
result[0] # id
result[1] # string
import csv
x,z = [],[]
csv_reader = csv.reader(open('Data.csv'))
for line in csv_reader:
x.append(line[0])
z.append(line[2])
This can help u getting data from 1st and 3rd column
Related
I guess professional data analysts know an answer to this, but I'm no analyst.
And I just barely know Pandas. So I am at a loss.
There are two lists. Their contents are unpredictable (parsed from web counters, web analytics, web statistics, etc).
list1 = ['WordA', 'WordB', ..., 'WordXYZ']
...and...
list2 = [['WordA1', 'WordA2'], ['WordB1'], ['WordC1', 'WordC2', ,'WordC96'], ..., ['WordXYZ1', 'WordXYZ2']]
Length of two lists are always equal (they`re the results of work of parser I already wrote)
What I need is to create a dataframe which will have two rows for each item in list1, each containing the word in first column, and then put corresponding words from list2 into first row of those two (starting from second column, first column to bealready filled from list1).
So I imagine the following steps:
Create a dataframe filled with empty strings ('') with number of columns equal to len(max(list2, key=len)) and number of rows equal to twice length of list1 (aaaand I don't know how, this is actually my very second time using Pandas at all!);
Somehow fill first column of resulting dataframe with contents of list1, filling two rows for each item in list1;
Somehow put contents of list2 into every even row of the dataframe, starting with second column;
Save into .xls file (yes, that's the final goal), enjoy job done.
Now first thing, I already spend half a day trying to find an answer "how to create pandas dataframe filled with empty strings with given number of rows and columns", and found a lot of different articles, which contradict each other.
And second, there's got to be a way to do all this more pythonic, more efficient and more stylish way!
Aaaand, maybe there`s a way to create an excel file without using pandas at all, which I just don't know about (hopefully, yet)
Can anyone help, please?
UPD: (to answer a question) the results should look like:
WordA WordA1 WordA2
WordA
WordB WordB1
WordB
WordC WordC1 WordC2 (...) WordC96
WordC
(...)x2
WordXYZ WordXYZ1 WordXYZ2
WordXYZ
If you just want to write the lists to an Excel file, you don't need pandas. You can use for instance openpyxl:
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for *word, words in zip(list1, list2):
ws.append(word + words)
ws.append(word)
wb.save('output.xlsx')
If you really want to use pandas:
import pandas as pd
df = pd.DataFrame([[None] + x if isinstance(x, list) else [x] for pair in zip(list2, list1) for x in pair])
df[0] = df[0].bfill()
df.to_excel('output.xlsx', index=False, header=False)
The following should give you (almost) what you want:
import pandas as pd
from itertools import chain
list1 = ['WordA', 'WordB']
list2 = [['WordA1', 'WordA2'], ['WordB1']]
# Flatten list 2
list2 = list(chain(*list2))
# Create DataFrames
list1 = pd.DataFrame(data=list1, columns=["word1"])
list2 = pd.DataFrame(data=list2, columns=["word2"])
# Prefix for list2
list2["prefix"] = list2["word2"].str.extract("([^0-9]+)")
list1 = list1.merge(list2, left_on="word1", right_on="prefix", how="inner")
# Concatenated words
list1 = list1.groupby("word1")["word2"].agg(lambda x: " ".join(x)).reset_index()
list1["word2"] = list1["word1"].str.cat(list1["word2"], sep=" ")
list1 = pd.melt(list1).sort_values(by="value")
I am trying to process a CSV file into a new CSV file with only columns of interest and remove rows with unfit values of -1. Unfortunately I get unexpected results, as it automatically includes column 0 (old ID) into the new CSV file without explicitly asking the script to do it (as it is not defined in cols = [..]).
How could I change these values for the new row count. That for, when for example we remove row 9 with an id=9, the dataset id goes currently as [..7,8,10...] instead of a new id count as [..7,8,9,10...]. I hope anyone got a solution for it.
import pandas as pd
# take only specific columns from dataset
cols = [1, 5, 6]
data = pd.read_csv('data_sample.csv', usecols=cols, header=None) data.columns = ["url", "gender", "age"]
# remove rows from dataset with undefined values of -1
data = data[data['gender'] != -1]
data = data[data['age'] != -1]
""" Additional working solution
indexGender = data[data['gender'] == -1].index
indexAge = data[data['age'] == -1].index
# Delete the rows indexes from dataFrame
data.drop(indexGender,inplace=True)
data.drop(indexAge, inplace=True)
"""
data.to_csv('data_test.csv')
Thank you in advance.
I solved the problem via simple line after the data drop:
data.reset_index(drop=True, inplace=True)
I have a csv file which I am splitting with delimiter ','. My target is to iterate through the first column of the entire file and if it matches with the word I have, then I need to have the subsequent values of that particular row into different lists.
Example:
AAA,man,2300,
AAA,woman,3300,
BBB,man,2300,
BBB,man,3300,
BBB,man,2300,
BBB,woman,3300,
CCC,woman,2300,
CCC,man,3300,
DDD,man,2300,
My code:
import csv
datafile = "test.txt"
with open('C:/Users/Dev/Desktop/TEST/Details/'+datafile, 'r') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print (rows)
If I search for a value BBB, I want to have the rest of the details of the rows into 3 different lists. (CSV file will always have only 4 columns; the fourth column might be empty sometimes, so we just leave it with a comma)
Sample:
list1 = [man, man, man, woman]
list2 = [2300, 3300, 2300, 3300]
list3 = [ , , , ,]
How can I do this?
Try it with pandas:
import pandas as pd
df = pd.read_csv('path/to/file',sep=',',header=None)
Now just use:
list1,list2,list3 = df[df[0] == "BBB"].T.values.tolist()
Example df:
df = pd.DataFrame(dict(col1=["AAA","BBB","BBB"],
col2=[1,2,3],
col3=[4,5,6]))
Outputs:
(['BBB', 'BBB'], [2, 3], [5, 6]) #list1,list2,list3
Answer for your question is there in your statement: "If I search for a value, say BBB, i want to have the rest of the details of the rows into 3 different lists"
Create empty list:-
list1=[]
list2=[]
list3=[]
Append values into those list:-
for row in reader:
if ( row[0] == "BBB" ):
list1.append(row[1])
list2.append(row[2])
list3.append(row[3])
You can initialize three empty list variables and then, in the loop of rows, if c1 matches your value, append the consequent columns to the list.
Edit: OR use pandas at Anton VBR has answered.
I'll ignore the part you read data from csv file.
let us begin with a list ( 2d array ). construct a for loop to to search only row1 for your condition - for your example result vector=[1,2,7,8,9]. this vector contains list of indices meeting your condition.
now to get the "filtered" list justmake another for loop extracting all other rows indices result_vector.
I am having trouble writing a loop that returns what I need. I have two CSV files. For the values in a column in CSV 1, I need to find if there are matching values in CSV 2 and if there are matching values, return a dataframe for the row of the matching values. When I try to create a loop, I cannot get the right values in the loop. For example:
import pandas as pd
csv2 = pd.read_csv('/users/jamesh/documents/asiopods/asicrawlconcat.csv', header = 1)
csv1 = pd.read_csv('/users/jamesh/documents/asiopods/asiconcat.csv', header = 0)
h1s = csv1['Recommended_H1']
h1 = h1s
h1[0:3] #test
subject = csv2['H1_1']
for x in h1:
for y in subject:
if x == y:
print y
The code above returns the values I need, but in string form. I need to return the dataframe for the values of y, from CSV2
Any help or direction is greatly appreciated!
Edit - with some offline help, I have been able get the correct information from the loop. However, I still can't figure out how to get the data into a pandas.dataframe. Instead the data is returned in a vertical manner. Here is the new loop:
def foogaiz():
for k1, v1 in h1.iteritems():
for k2, v2 in subject.iteritems():
if v1 == v2:
data = csv2.irow(k2)
return data
It's a little unclear if the values you're matching on ("Recommend_H1" in your example) are unique and only appear once in asiconcat.csv. If so, then I recommend naming the two columns that have matching values the same ('H1_1' in my example syntax below) and doing a df.merge()
matched_df = df.merge(crawldf,on="H1_1",how="left")
The left join option is in order to keep rows that don't have matches on crawldf.
You can read the documentation for merge here:
http://pandas.pydata.org/pandas-docs/stable/merging.html
What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
You can dynamically generate column names as simple counters (0, 1, 2, etc).
Dynamically generate column names
# Input
data_file = "smallsample.txt"
# Delimiter
data_file_delimiter = ','
# The max column count a line in the file could have
largest_column_count = 0
# Loop the data lines
with open(data_file, 'r') as temp_f:
# Read the lines
lines = temp_f.readlines()
for l in lines:
# Count the column count for the current line
column_count = len(l.split(data_file_delimiter)) + 1
# Set the new most column count
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]
# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)
Missing values will be assigned to the columns which your CSV lines don't have a value for.
Polished version of P.S. answer is as follows. It works.
Remember we have inserted lot of missing values in the dataframe.
### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)
If you want something really concise without explicitly giving column names, you could do this:
Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame
df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("smallsample.txt",header = None,names=range(8))
Use range instead of manually setting names as it will be cumbersome when you have many columns.
You can use shantanu pathak's method to find longest row length in your data.
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
new_data = data.fillna(0)
We could even use pd.read_table() method to read csv file which converts it into type DataFrame of single columns which can be read and split by ','
Manipulate your csv and in the first row, put the row that has the most elements, so that all next rows have less elements. Pandas will create as much columns as the first row has.