Panda module export, split data - python

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)

Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Related

Python & Pandas: appending data to new column

With Python and Pandas, I'm writing a script that passes text data from a csv through the pylanguagetool library to calculate the number of grammatical errors in a text. The script successfully runs, but appends the data to the end of the csv instead of to a new column.
The structure of the csv is:
The working code is:
import pandas as pd
from pylanguagetool import api
df = pd.read_csv("Streamlit\stack.csv")
text_data = df["text"].fillna('')
length1 = len(text_data)
for i, x in enumerate(range(length1)):
# this is the pylanguagetool operation
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
# this pulls the error count "message" from the pylanguagetool json
error_count = result.count("message")
output_df = pd.DataFrame({"error_count": [error_count]})
output_df.to_csv("Streamlit\stack.csv", mode="a", header=(i == 0), index=False)
The output is:
Expected output:
What changes are necessary to append the output like this?
Instead of using a loop, you might consider lambda which would accomplish what you want in one line:
df["error_count"] = df["text"].fillna("").apply(lambda x: len(api.check(x, api_url='https://languagetool.org/api/v2/', lang='en-US')["matches"]))
>>> df
user_id ... error_count
0 10 ... 2
1 11 ... 0
2 12 ... 0
3 13 ... 0
4 14 ... 0
5 15 ... 2
Edit:
You can write the above to a .csv file with:
df.to_csv("Streamlit\stack.csv", index=False)
You don't want to use mode="a" as that opens the file in append mode whereas you want (the default) write mode.
My strategy would be to keep the error counts in a list then create a separate column in the original database and finally write that database to csv:
text_data = df["text"].fillna('')
length1 = len(text_data)
error_count_lst = []
for i, x in enumerate(range(length1)):
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
error_count = result.count("message")
error_count_lst.append(error_count)
text_data['error_count'] = error_count_lst
text_data.to_csv('file.csv', index=False)

How to Convert a text data into DataFrame

How i can convert the below text data into a pandas DataFrame:
(-9.83334315,-5.92063135,-7.83228037,5.55314146), (-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976), (-22.25802006,-10.12843806,-2.9688831,-2.70574665), (-20.3418791,-9.4157625,-3.348587,-7.65474665)
I want to convert this to Data frame with 4 rows and 5 columns. For example, the first row contains the first element of each parenthesis.
Thanks for your contribution.
Try this:
import pandas as pd
with open("file.txt") as f:
file = f.read()
df = pd.DataFrame([{f"name{id}": val.replace("(", "").replace(")", "") for id, val in enumerate(row.split(",")) if val} for row in file.split()])
import re
import pandas as pd
with open('file.txt') as f:
data = [re.findall(r'([\-\d.]+)',data) for data in f.readlines()]
df = pd.DataFrame(data).T.astype(float)
Output:
0 1 2 3 4
0 -9.833343 -5.531373 -11.492390 -22.258020 -20.341879
1 -5.920631 -8.310108 -1.680536 -10.128438 -9.415762
2 -7.832280 -3.280625 -4.147730 -2.968883 -3.348587
3 5.553141 -6.860671 -3.541440 -2.705747 -7.654747
Your data is basically in tuple of tuples forms, hence you can easily use pass a list of tuples instead of a tuple of tuples and get a DataFrame out of it.
Your Sample Data:
text_data = ((-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665))
Result:
As you see it's default takes up to 6 decimal place while you have 7, hence you can use pd.options.display.float_format and set it accordingly.
pd.options.display.float_format = '{:,.8f}'.format
To get your desired data, you simply use transpose altogether to get the desired result.
pd.DataFrame(list(text_data)).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
OR
Simply, you can use as below as well, where you can create a DataFrame from a list of simple tuples.
data = (-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)
# data = [(-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
pd.DataFrame(data).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
wrap the tuples as a list
data=[(-9.83334315,-5.92063135,-7.83228037,5.55314146),
(-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976),
(-22.25802006,-10.12843806,-2.9688831,-2.70574665),
(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
df=pd.DataFrame(data, columns=['A','B','C','D'])
print(df)
output:
A B C D
0 -9.833343 -5.920631 -7.832280 5.553141
1 -5.531373 -8.310108 -3.280625 -6.860671
2 -11.492390 -1.680536 -4.147730 -3.541440
3 -22.258020 -10.128438 -2.968883 -2.705747
4 -20.341879 -9.415762 -3.348587 -7.654747

How to extract values from a list in Python and put into a dataframe

I have trained a model and have asked the model to produce the coefficients:
modelcoeffs = model.fit(X_train, y_train).coef_
coeffslist = list(modelcoeffs)
which yiels me for example:
print(coeffslist):
[0.17005542 0.72965947 0.6833308 0.02509676]
I am trying to split these 4 coefficients out however they dont seem to be individual elements?
does anyone know how to split these into four numbers?
I am trying to get:
df['1'] = coeffslist[0]
df['2'] = coeffslist[1]
df['3'] = coeffslist[2]
df['4'] = coeffslist[3]
But it gives me NaN in the df. Does anyone have any ideas? thanks!
UPDATE
I am basically trying to get the coeffs to append to a df
print(df)
1 2 3 4
.... ..... ..... .....
0.17005542 0.72965947 0.6833308 0.02509676
This coeffslist doesn't look like a valid Python structure, it's missing commas.
But you might try this:
import pandas as pd
df = pd.DataFrame([0.17005542, 0.72965947, 0.6833308, 0.02509676])
print(df)
Output:
0
0 0.170055
1 0.729659
2 0.683331
3 0.025097
To get the coefs as row try this:
import pandas as pd
df = pd.DataFrame(columns=list("1234"))
df.loc[len(df)] = [0.17005542, 0.72965947, 0.6833308, 0.02509676]
print(df)
Output:
1 2 3 4
0 0.170055 0.729659 0.683331 0.025097
And if you want to add another row (append) of coefs, just do this:
df.loc[1] = [0.17005542, 0.72965947, 0.6833308, 0.02509676]
print(df)
Output:
1 2 3 4
0 0.170055 0.729659 0.683331 0.025097
1 0.170055 0.729659 0.683331 0.025097
you can convert [0.17005542 0.72965947 0.6833308 0.02509676] to a sting, split it on space, convert to float again and then append to a dataframe.
str_list= str(coeffslist[0])
float_list= [float(x) for x in str_list.split()]
df=pd.DataFrame(columns=['1','2','3','4'])
a_series = pd.Series(float_list, index = df.columns)
df = df.append(a_series, ignore_index=True)

Python Remove duplicates from csv if value in column duplicated

I am trying to write csv parser so if i have the same name in the name column i will delete the second name's line. For example:
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-04-18-192446'],
['CSE_MAIN\\IT-Laptop12', 'DEREGISTERED', '2018-03-28-144236'],
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-03-28-144236']]
I need that the last line will be deleted because it has the same name as the first one.
What i wrote is:
file2 = str(sys.argv[2])
print ("The first file is:" + file2)
reader2 = csv.reader (open(file2))
with open("result2.csv",'wb') as result2:
wtr2= csv.writer( result2 )
for r in reader2:
wtr2.writerow( (r[0], r[6], r[9] ))
newreader2 = csv.reader (open("result2.csv"))
sortedlist2 = sorted(newreader2, key=lambda col: col[2] , reverse = True)
for i in range(len(sortedlist2)):
for j in range(len(sortedlist2)-1):
if (sortedlist2[i][0] == sortedlist2[j+1][0] and sortedlist2[i][1]!=sortedlist2[j+1][1]):
if(sortedlist2[i][1]>sortedlist2[j+1][1]):
del sortedlist2[i][0-2]
else:
del sortedlist2[j+1][0-2]
Thanks.
Try with pandas:
import pandas as pd
df = pd.read_csv('path/name_file.csv')
df = df.drop_duplicates([0]) #0 this is columns which will compare.
df.to_csv('New_file.csv') #save to csv
This method delete all duplicates from columns 1.
If you need simple delete you can use method drop.
#You file after use pandas (print(df)):
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236
2 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-03-28-144236
For example you need delete 2 row.
df.drop(2,axis=0, inplace=True) #axis=0 means row, if you switch 1 this is columns.
Output:
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236

Merge text file with a csv database file using pandas

[Update my question]
I have a text file looks like below,
#File_infoomation1
#File_information2
A B C D
1 2 3 4.2
5 6 7 8.5 #example.txt separate by tab '\t' column A dtype is object
I'd like to merge the text file with a csv database file based on column E. The column contains integer.
E,name,age
1,john,23
5,mary,24 # database.csv column E type is int64
So I tried to read the text file then remove first 2 unneeded head lines.
example = pd.read_csv('example.txt', header = 2, sep = '\t')
database = pd.read_csv('database.csv')
request = example.rename(columns={'A': 'E'})
New_data = request.merge(database, on='E', how='left')
But the result does not appear the stuff I want, while it shows NaN in column name and age,
I think int64 and object dtype is where the mistake, dose anyone know how to work this out?
E,B,C,D,name,age
1,2,3,4.2,NaN,NaN
5,6,7,8.5,NaN,NaN
You just need to edit this in your code:
instead of
example = pd.read_csv('example.txt', header = 2, sep = '\t', delim_whitespace=False )
Use this:
example = pd.read_csv('example.txt', sep = ' ' ,index_col= False)
Actually I tried reading your files with:
example = pd.read_csv('example.txt', header = 2, sep = '\t')
# Renaming
example.columns = ['E','B','C','D']
database = pd.read_csv('database.csv')
New_data = example.merge(database, on='E', how='left')
And this returns:
E B C D name age
0 1 2 3 4.2 john 23
1 5 6 7 8.5 mary 24
EDIT: actually is not clear the separator of the original example.txt file. If it is space try putting sep='\s' instead sep=' ' for space.

Categories

Resources