problem with line terminator \n on dataframe and .csv

problem with line terminator \n on dataframe and .csv - python

Im getting (with an python API) a .csv file from an email attachment that i received in gmail, transforming it into a dataframe to make some dataprep, and saving as .csv on my pc. It is working great, the problem is that i get '\n' on some columns(it came like that from the source attachment).
the code that i used to get the data and transform into dataframe and .csv
r = io.BytesIO(part.get_payload(decode = True))
df = pd.DataFrame(r)
df.to_csv('C:/Users/x.csv', index = False)
Example of df that i get:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b\n' | | | |
| c | e | \r\n' | |
+-------------+----------+---------+----------------------+
example of answer that is correct:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b | c | e | \r\n' |
+-------------+----------+---------+----------------------+
i tried to use the line_terminator. in my mind, if i force it to get only \r\n and not \n, it would work. It didnt.
df.to_csv('C:/Users/x.csv', index = False, line_terminator='\r\n')
can somebody give me a help with that? its really freaking me out, because of that i cant advance at my project. thanks.

Usually, this "\n" appears to mark that sentence is going for next line i.e ‘return’ key, line break.
You can get rid of it just by applying replace('\n', '') on your dataframe:
df = df.replace('\n', '')
For more details on the function, consider checking this specific Pandas documentation
Hope it works.

I mixed the two answers and got the solution, thanks!!!!!
PS: with some research i found that this is a windows/excel issue, when you export .csv it considers \n and \r\n (\r too?) as new row. DataFrame considers only \r\n as new row(when default).
df = pd.read_csv(io.BytesIO(part.get_payload(decode = True)), header=None)
#grab the first row for the header
new_header = df.iloc[0]
#take the data less the header row
df = df[1:]
#set the header row as the df header
df.columns = new_header
#replace the \n wich is creating new lines
df['Information'] = df['Information'].replace(regex = '\n', value = '')
df.to_csv('C:/Users/x.csv', index = False', index = False)

Related

Python count hashtag per platform

My data is organized in a data frame with the following structure
| ID | Post | Platform |
| -------- | ------------------- | ----------- |
| 1 | Something #hashtag1 | Twitter |
| 2 | Something #hashtag2 | Insta |
| 3 | Something #hashtag1 | Twitter |
I have been able to extract and count the hashtag using the following (using this post):
df.Post.str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
I am now trying to count hashtag operation occurrence from each platform. I am trying the following:
df.groupby(['Post', 'Platform'])['Post'].str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
But, I am getting the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'str'

We can solve this easily using 2 steps.Assumption each post has just single hashtag
Step 1: Create a new column with Hashtag
df['hashtag']= df.Post.str.extractall(r'(\#\w+)')[0].reset_index()[0]
Step 2: Group by and get the counts
df.groupby([ 'Platform']).hashtag.count()
Generic Solutions Works for any number of hashtag
We can solve this easily using 2 steps.
# extract all hashtag
df1 = df.Post.str.extractall(r'(\#\w+)')[0].reset_index()
# Ste index as index of original tagle where hash tag came from
df1.set_index('level_0',inplace = True)
df1.rename(columns={0:'hashtag'},inplace = True)
df2 = pd.merge(df,df1,right_index = True, left_index = True)
df2.groupby([ 'Platform']).hashtag.count()

Pyspark : How to escape backslash ( \ ) in input file

I am loading a csv file into postgresql using pyspark. I have a record in the input file which looks like below -
Id,dept,city,name,country,state
1234,ABC,dallas,markhenry\,USA,texas
When I load it into the postgresql database then it gets loaded like this which is not correct -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry,USA | texas | null
correct output in postgresdb should be -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry | USA | texas
I am reading the file like below -
input_df = spark.read.format("csv").option("quote", "\"").option("escape", "\"").option("header",
"true").load(filepath)
Is there a way I can modify my code to handle the backslash () coming in the data. Thanks in advance

The purpose of the "quote" option is to specify a quote character, which wraps entire column values. Not sure if that is needed here, but you can use the regexp_replace function to remove specific characters (just select everything else as-is and modify the name column this way).
from pyspark.sql.functions import *
df = spark.read.option("inferSchema", "true").option("header", "true").csv(filepath)
df2 = df.select(col("Id"), col("dept"), col("city"), regexp_replace(col("name"), "\\\\", "").alias("name"), col("country"), col("state"))
df2.show(4, False)
Output:
+----+----+------+---------+-------+-----+
|Id |dept|city |name |country|state|
+----+----+------+---------+-------+-----+
|1234|ABC |dallas|markhenry|USA |texas|
+----+----+------+---------+-------+-----+

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!

changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.

Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...

You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

Eliminate perceived index value from data frame concatenation

I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks

so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

problem with line terminator \n on dataframe and .csv - python

Related

Python count hashtag per platform

Pyspark : How to escape backslash ( \ ) in input file

Pandas not displaying all columns when writing to

string manipulation, data wrangling, regex

Eliminate perceived index value from data frame concatenation

Categories

Resources