Merge text file with a csv database file using pandas - python

[Update my question]
I have a text file looks like below,
#File_infoomation1
#File_information2
A B C D
1 2 3 4.2
5 6 7 8.5 #example.txt separate by tab '\t' column A dtype is object
I'd like to merge the text file with a csv database file based on column E. The column contains integer.
E,name,age
1,john,23
5,mary,24 # database.csv column E type is int64
So I tried to read the text file then remove first 2 unneeded head lines.
example = pd.read_csv('example.txt', header = 2, sep = '\t')
database = pd.read_csv('database.csv')
request = example.rename(columns={'A': 'E'})
New_data = request.merge(database, on='E', how='left')
But the result does not appear the stuff I want, while it shows NaN in column name and age,
I think int64 and object dtype is where the mistake, dose anyone know how to work this out?
E,B,C,D,name,age
1,2,3,4.2,NaN,NaN
5,6,7,8.5,NaN,NaN

You just need to edit this in your code:
instead of
example = pd.read_csv('example.txt', header = 2, sep = '\t', delim_whitespace=False )
Use this:
example = pd.read_csv('example.txt', sep = ' ' ,index_col= False)

Actually I tried reading your files with:
example = pd.read_csv('example.txt', header = 2, sep = '\t')
# Renaming
example.columns = ['E','B','C','D']
database = pd.read_csv('database.csv')
New_data = example.merge(database, on='E', how='left')
And this returns:
E B C D name age
0 1 2 3 4.2 john 23
1 5 6 7 8.5 mary 24
EDIT: actually is not clear the separator of the original example.txt file. If it is space try putting sep='\s' instead sep=' ' for space.

Related

How do I create

I have a text file that needs to be read line by line and converted into a data frame with the 4 following columns
import re
import pandas as pd
with open('/Users/Desktop/Final Semester Fall 2022/archive/combined_data_1.txt',encoding='latin-1') as f:
for line in f:
result = re.search(r"^(\d+),(\d+),(\d{4}-\d{2}-\d{2})/gm", line)
if re.search(r"(^\d+):", line) is not None:
movie_id = re.search(r"(^\d+):", line).group(1)
elif result:
customerid = result.group(1)
rating = result.group(2)
date = result.group(3)
else:
continue
data_list = [customerid, rating, date, movie_id]
df1 = pd.DataFrame(data_list)
df1.to_csv(r'/Users/Desktop/Final Semester Fall 2022/archive/combineddata1.csv')
Im getting the following error:
How do I fix this error???
Thanks in advance!!
here is one way to do it
# read the csv file using read_csv, using ":" as a separator
# since there is only one colon ":" per movie, you end up with a row for movie following by rows for the rest of the data.
df=pd.read_csv(r'c:\csv.csv', sep=':', header=None, names=['col1', 'col2'])
# when there is no comma in a row, means its only a movie id,
# so we populate the movieid column and downfill for all rows
df['MovieId'] = df['col1'].mask(df['col1'].str.contains(',')).ffill()
# split the data into CusotmerId, rating and date
df[['CustomerID','Rating','Date']] = df['col1'].str.split(',',expand=True)
# drop the unwanted columns and rows
df2=df[df['col1'].ne(df['MovieId'])].drop(columns=['col1','col2'])
df2
# sample created from the data you shared above as image
MovieId CustomerID Rating Date
1 1 1488844 3 2005-09-06
2 1 822109 5 2005-05-13
3 1 885013 4 2005-10-19
4 1 30878 4 2005-12-26
5 1 823519 3 2004-05-03
6 1 893988 3 2005-11-17
7 1 124105 4 2004-08-05
8 1 1248629 3 2004-04-22
9 1 1842128 4 2004-05-09
10 1 2238063 3 2005-05-11
11 1 1503895 4 2005-05-19
13 2 1288844 3 2005-09-06
14 2 832109 5 2005-05-13
You can parse that structure quite easily (without regex, using a few lines of very readable vanilla Python) and build a dictionary while reading the data file. You can then convert the dictionary to a DataFrame in one go.
import pandas as pd
df = {'MovieID':[], 'CustomerID':[], 'Rating':[], 'Date':[]}
with open('data.txt', 'r') as f:
for line in f:
line = line.strip()
if line: #skip empty lines
if line.endswith(':'): #MovieID
movie_id = line[:-1]
else:
customer_id, rating, date = line.split(',')
df['MovieID'].append(movie_id)
df['CustomerID'].append(customer_id)
df['Rating'].append(rating)
df['Date'].append(date)
df = pd.DataFrame(df)
print(df)
MovieID CustomerID Rating Date
0 1 1488844 3 2005-09-06
1 1 822109 5 2005-05-13
2 1 885013 4 2005-10-19
3 1 30878 4 2005-12-26
4 2 823519 3 2004-05-03
5 2 893988 3 2005-11-17
6 2 124105 4 2004-08-05
7 2 1248629 3 2004-04-22
8 2 1842128 4 2004-05-09
9 3 2238063 3 2005-05-11
10 3 1503895 4 2005-05-19
11 3 1288844 3 2005-09-06
12 3 832109 5 2005-05-13
It hardly gets easier than this.
An error in a regular expression
You've got the NameError because of /gm in the regular expression you use to identify result.
I suppose that /gm was coppied here by mistake. In other languages this could be GLOBAL and MULTILINE match modifiers, which by the way are not needed in this case. But in the python re module they are just three character. As far as you have no line with /gm inside, your result was allways None, so the elif result: ... block was never executed and variables customerid, rating, date were not initialized.
An error in working with variables
If you remove /gm from the first matching, you'll have another problem: the variables customerid, rating, date, movie_id are just strings, so the resulting data frame will reflect only the last record of the source file.
To avoid this we have to work with them as with a list-like structure. For example, in the code below, they are keys in the data dictionary, each referring to a separate list:
file_name = ...
data = {'movie_id': [], 'customerid': [], 'rating': [], 'date': []}
with open(file_name, encoding='latin-1') as f:
for line in f:
result = re.search(r"^(\d+),(\d+),(\d{4}-\d{2}-\d{2})", line)
if re.search(r"(^\d+):", line) is not None:
movie_id = re.search(r"(^\d+):", line).group(1)
elif result:
data['movie_id'].append(movie_id)
data['customerid'].append(result.group(1))
data['rating'].append(result.group(2))
data['date'].append(result.group(3))
else:
continue
df = pd.DataFrame(data)
Code with test data
import re
import pandas as pd
data = '''\
1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26
2:
823519,3,2004-05-03
893988,3,2005-11-17
124105,4,2004-08-05
1248629,3,2004-04-22
1842128,4,2004-05-09
3:
2238063,3,2005-05-11
1503895,4,2005-05-19
1288844,3,2005-09-06
832109,5,2005-05-13
'''
file_name = "data.txt"
with open(file_name, 'tw', encoding='latin-1') as f:
f.write(data)
data = {'movie_id': [], 'customerid': [], 'rating': [], 'date': []}
with open(file_name, encoding='latin-1') as f:
for line in f:
result = re.search(r"^(\d+),(\d+),(\d{4}-\d{2}-\d{2})", line)
if re.search(r"(^\d+):", line) is not None:
movie_id = re.search(r"(^\d+):", line).group(1)
elif result:
data['movie_id'].append(movie_id)
data['customerid'].append(result.group(1))
data['rating'].append(result.group(2))
data['date'].append(result.group(3))
else:
continue
df = pd.DataFrame(data)
df.to_csv(file_name[:-3] + 'csv', index=False)
An alternative
df = pd.read_csv(file_name, names = ['customerid', 'rating', 'date'])
df.insert(0, 'movie_id', pd.NA)
isnot_movie_id = ~df['customerid'].str.endswith(':')
df['movie_id'] = df['customerid'].mask(isnot_movie_id).ffill().str[:-1]
df = df.dropna().reset_index(drop=True)

How to read a file containing more than one record type within?

I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!
Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!
Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})
You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}
Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A

Panda module export, split data

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Name a column added with pandas dataframe

l have the following csv file that l process as follow
import pandas as pd
df = pd.read_csv('file.csv', sep=',',header=None)
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes
00196436-12bc-4024-b623-25bac586d314 A know
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi
002882ca-48bb-4161-a75a-cf0ec984d650 A fd
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible
004d9025-86f0-4f8c-9720-01e3385c5e77 A 2015
Now l want to add a new column :
df['val']=None
for img in images:
id, ext = img.rsplit('.',1)
idx = df[df[0] ==id].index.values
df.loc[df.index[idx], 'val'] = id
When l write df in a new file as follow :
df.to_csv('new_file.csv', sep=',',encoding='utf-8')
l noticed that the column is correctly added and filled. But the column remains without name and it's supposed to be named val
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3 4
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c 3
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes 1
00196436-12bc-4024-b623-25bac586d314 A know 8
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi 9
002882ca-48bb-4161-a75a-cf0ec984d650 A fd 10
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible 14
How to set set to the last column added ?
EDIT1:
print(df.head())
0 1 2 3
0 id ocr raw_value manual_raw_value
1 00037625-4706-4dfe-a7b3-de8c47e3a28d ABBYY 03 03
2 000a7b30-4c4f-4756-a757-f688ccc55d5d ABBYY y/c y/c
3 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ABBYY armoire armoire
4 00196436-12bc-4024-b623-25bac586d314 ABBYY point point
val
0 None
1 93
2 yic
3 armoire
4 point
Need only read_csv, because sep=',' is by default and can be omit and header=None is used if csv have no header:
df = pd.read_csv('file.csv')
Problem is your first row was not parsed to columns names, but to first data row.
df = pd.read_csv('file.csv', sep=',', header=0, index_col=0)
should allow you to simplify the next portion to
df['val']=None
for img in images:
image_id, ext = img.rsplit('.',1)
df.loc[image_id, 'val'] = image_id
If you don't need the image_id as index afterwards, use df.reset_index(inplace=True)
one easy way...
before to_csv:
df.columns.value[3]="val"

How to parse CSV file and search by item in first column

I have a CSV file with over 4,000 lines formatted like...
name, price, cost, quantity
How do I trim my CSV file so only the 20 names I want remain? I am able to parse/trim the CSV file, I am coming up blank on how to search column 1.
Use pandas!
import pandas as pd
df = pd.DataFrame({'name': ['abc', 'ght', 'kjh'], 'price': [7,5,6], 'cost': [9, 0 ,2], 'quantity': [1,3,4]})
df = pd.read_csv('input_csv.csv') # Your case you would import like this
>>> df
cost name price quantity
0 9 abc 7 1
1 0 ght 5 3
2 2 kjh 6 4
>>> names_wanted = ['abc','kjh']
>>> df_trim = df[df['name'].isin(names_wanted)]
>>> df_trim
cost name price quantity
0 9 abc 7 1
2 2 kjh 6 4
Then export the file to csv:
>>> df_trim.to_csv('trimmed_csv.csv', index=False)
Done!
You can loop though the csv.reader(). It will return you rows. Rows consist of lists. Compare the first element of the list ie row[0]. If it is what you want, add the row to an output list.
You could create an ASCII test file with each of the 20 names on a separate line (perhaps called target_names). Then, with your CSV file (perhaps called file.csv), on the command line (bash):
for name in $(cat target_names); do grep $name file.csv >> my_new_small_file.csv; done
If you have issues with case sensitivity, use grep -i.
Not sure I understood you right but, can the snippet below do what you want ?
def FilterCsv(_sFilename, _aAllowedNameList):
l_aNewFileLines = []
l_inputFile = open(_sFilename, 'r')
for l_sLine in l_inputFile:
l_aItems = l_sLine.split(',')
if l_aItems[0] in _aAllowedNameList:
l_aNewFileLines.append(l_sLine)
l_inputFile.close()
l_outputFile = open('output_' + _sFilename, 'w')
for l_sLine in l_aNewFileLines:
l_outputFile.write(l_sLine)
l_outputFile.close()
Hope this can be of any help!

Categories

Resources