read csv with pandas with this kind of dataset - python

I have some troubles to read a dataset like this:
# title
# description
# link (could be not still active)
# id
# date
# source (nyt|us|reuters)
# category
example:
court agrees to expedite n.f.l.'s appeal\n
the decision means a ruling could be made nearly two months before the regular season begins, time for the sides to work out a deal without delaying the
season.\n
http://feeds1.nytimes.com/~r/nyt/rss/sports/~3/nbjo7ygxwpc/04nfl.html\n
0\n
04 May 2011 07:39:03\n
nyt\n
sport\n
I tried:
columns = ['title', 'description', 'link', 'id', 'date', 'source', 'category']
df = pd.read_csv('news', delimiter = "\n", names = columns,error_bad_lines=False)
But it put all the information into the columns title.
Do someone knows a way to deal with this ?
Thanks !

You can't use \n as a delimiter for csv, what you could do is set the index equal to the column names, and then transpose, i.e.
df = pd.read_csv('news', index=columns).transpose()

Here are a few things to note:
1) Any delimiter more than 1 character long is interpreted by Pandas a regular expression.
2) Since 'c' engine does not support regex, I have explicitly defined engine as 'python' to avoid warnings.
3) I had to add a dummy column because there is a '\n' at the end of the file, i have later removed that column using drop.
So, these lines will hopefully result what you want.
columns = ['title', 'description', 'link', 'id', 'date', 'source', 'category','dummy']
df = pd.read_csv('news', names=columns, delimiter="\\\\n", engine='python').drop('dummy',axis=1)
df
I hope this helps :)

Related

Python - Pandas Search and update multiple values - is this possible or should I use Import CSV?

I have CSV with 10k lines. I want to first of all search for the rows with required info and then edit those lines.
example below
jobs = [
'X01_TEST1_C',
'P01_TEST3_B'
]
headers = ['job', 'name', 'date', 'extrainfo']
data = [
['X01_TEST1_C', 'NAME', 'DATE', 'EXTRADATA'],
['P01_TEST3_C', 'NAME', 'DATE', 'EXTRADATA'],
['X01_TEST1002_C', 'NAME', 'DATE', 'EXTRADATA'],
['X01_TEST4231_C', 'NAME', 'DATE', 'EXTRAP01_TEST3_BDATA']
]
I can load this into PANDAS and then search for single items using below.
df= pd.read_csv("filename",sep=",", encoding='cp1252')
df1 = df[(df['job'].str.contains("X01_TEST1_C", na=False))]
print(df1)
which would print
['X01_TEST1_C', 'NAME', 'DATE', 'EXTRADATA']
How can I search for multiple values at once via pandas
I want something like
df1 = df[(df['job'].str.contains(jobs, na=False))]
But I get error TypeError: first argument must be string or compiled pattern
Once I get passed this part I want to update some jobs from X01_TEST1_C to X01_NEW_TEST1_C - adding this bit of info in incase easier to do whole thing at once.
Is Pandas good for this or do I need to try via different method like import csv?
Thanks for any help.
try:
jobs = [
'X01_TEST1_C',
'P01_TEST3_B'
]
df1 = df[df['job'].str.contains('|'.join(jobs), na=False)] #the default is regex=True so no need to add it
#this is similar to:
df1 = df[df['job'].str.contains('X01_TEST1_C|P01_TEST3_B', na=False)]
The simplest might be to concatenate your jobs to a single regexp and use that:
jobs_re = "|".join(re.escape(job) for job in jobs)
df1 = df[df['job'].str.contains(jobs_re, regex=True, na=False)]

PySpark DataFrame showing different results when using .select()

Why is .select() showing/parsing values differently to I don't use it?
I have this CSV:
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName,ConfStmtNextDueDate, ConfStmtLastMadeUpDate
"ATS CAR RENTALS LIMITED","10795807","","",", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD","ACCOUNTING FREEDOM","BEXLEYHEATH","","ENGLAND","DA6 8DS","Private Limited Company","Active","United Kingdom","","31/05/2017","31","5","28/02/2023","31/05/2021","TOTAL EXEMPTION FULL","28/06/2018","","0","0","0","0","49390 - Other passenger land transport","","","","0","0","http://business.data.gov.uk/id/company/10795807","","","","","","","","","","","","","","","","","","","","","12/06/2023","29/05/2022"
"ATS CARE LIMITED","10393661","","","UNIT 5 CO-OP BUILDINGS HIGH STREET","ABERSYCHAN","PONTYPOOL","TORFAEN","WALES","NP4 7AE","Private Limited Company","Active","United Kingdom","","26/09/2016","30","9","30/06/2023","30/09/2021","UNAUDITED ABRIDGED","24/10/2017","","0","0","0","0","87900 - Other residential care activities n.e.c.","","","","0","0","http://business.data.gov.uk/id/company/10393661","17/05/2018","ATS SUPPORT LIMITED","22/12/2017","ATS CARE LIMITED","","","","","","","","","","","","","","","","","09/10/2022","25/09/2021"
I'm reading the csv like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
_file = "/path/dir/BasicCompanyDataAsOneFile-2022-08-01.csv"
df = spark.read.csv(_file, header=True, quote='"', escape="\"")
Focusing on the CompanyCategory column, we should see Private Limited Company for both lines. But this is what I get instead when using select():
df.select("CompanyCategory").show(truncate=False)
+-----------------------+
|CompanyCategory |
+-----------------------+
|DA6 8DS |
|Private Limited Company|
+-----------------------+
df.select("CompanyCategory").collect()
[Row(CompanyCategory='DA6 8DS'),
Row(CompanyCategory='Private Limited Company')]
vs when not using select():
from pprint import pprint
for row in df.collect():
pprint(row.asDict())
{' CompanyNumber': '10795807',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CAR RENTALS LIMITED',
...}
{' CompanyNumber': '10393661',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CARE LIMITED',
...}
Using asDict() for readability.
SQL doing the same thing:
df.createOrReplaceTempView("companies")
spark.sql('select CompanyCategory from companies').show()
+--------------------+
| CompanyCategory|
+--------------------+
|Private Limited C...|
| DA6 8DS|
+--------------------+
As you can see when not using select() the CompanyCategory values are showing correctly. Why is this happening? What can I do to avoid this?
Context: I'm trying to creating dimension tables which is why I'm selecting a single column. The next phase is to drop duplicates, filter, sort, etc.
Edit:
Here are two example values in the actual CSV that are throwing things off:
CompanyName of """ BORA "" 2 LTD"
1st line address of ", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD"
Note:
These values from two separate distinct lines in the CSV.
These values are copy and pasted from the CSV opened in text editor like Notepad or VSCode).
Tried and failed:
df = spark.read.csv(_file, header=True) - completely picks up incorrect column.
df = spark.read.csv(_file, header=True, escape='\"') - exact same thing described in original question above. So same results.
df = spark.read.csv(_file, header=True, escape='""') - since the CSV escapes quotes using two double quotes, then I guess using two double quotes as escape param would do the trick? But getting following error:
Py4JJavaError: An error occurred while calling o276.csv.
: java.lang.RuntimeException: escape cannot be more than one character
When reading the csv, the parameters quote and escape are set to the same value ('"'=="\"" returns True in Python).
I would guess that configuring both parameters in this way will somehow disturb the parser that Spark uses to separate the single fields. After removing the escape parameter you can process the remaining " with regexp_replace:
from pyspark.sql import functions as F
df = spark.read.csv(<filename>, header=True, quote='"')
cols = [F.regexp_replace(F.regexp_replace(
F.regexp_replace("`" + col + "`", '^"', ''),
'"$', ''), '""', '"').alias(col) for col in df.columns]
df.select(cols).show(truncate=False)
Probably there is a smart regexp that can combine all three replace operations into one...
This is an issue when reading a single column from CSV file vs. when reading all the columns:
df = spark.read.csv('companies-house.csv', header=True, quote='"', escape="\"")
df.show() # correct output (all columns loaded)
df.toPandas() # same as pd.read_csv()
df.select('CompanyCategory').show() # wrong output (trying to load a single column)
df.cache() # all columns will be loaded and used in any subsequent call
df.select('CompanyCategory').show() # correct output
The first select() performs a different (optimized) read than the second, so one possible workaround would be to cache() the data immediately. This will however load all the columns, not just one (although pandas and COPY do the same).
The problematic part of the CSV is the RegAddress.POBox column where empty value is saved as ", instead of "",. You can check this by incrementally loading more columns:
df.unpersist() # undo cache() operation (for testing purposes only)
df.select(*[f"`{c}`" for c in df.columns[:3]], 'CompanyCategory').show() # wrong
df.select(*[f"`{c}`" for c in df.columns[:4]], 'CompanyCategory').show() # OK

Pandas method chaining with str.contains and str.split

I'm learning Pandas method chaining and having trouble using str.conains and str.split in a chain. The data is one week's worth of information scraped from a Wikipedia page, I will be scraping several years worth of weekly data.
This code without chaining works:
#list of data scraped from web:
list = ['Unnamed: 0','Preseason-Aug 11','Week 1-Aug 26','Week 2-Sep 2',
'Week 3-Sep 9','Week 4-Sep 23','Week 5-Sep 30','eek 6-Oct 7','Week 7-Oct 14',
'Week 8-Oct 21','Week 9-Oct 28','Week 10-Nov 4','Week 11-Nov 11','Week 12-Nov 18',
'Week 13-Nov 25','Week 14Dec 2','Week 15-Dec 9','Week 16 (Final)-Jan 4','Unnamed: 18']
#load to dataframe:
df = pd.DataFrame(list)
#rename column 0 to text:
df = df.rename(columns = {0:"text"})
#remove ros that contain "Unnamed":
df = df[~df['text'].str.contains("Unnamed")]
#split column 0 into 'week' and 'released' at the hyphen:
df[['week', 'released']] = df["text"].str.split(pat = '-', expand = True)
Here's my attempt to rewrite it as a chain:
#load to dataframe:
df = pd.DataFrame(list)
#function to remove rows that contain "Unnamed"
def filter_unnamed(df):
df = df[~df["text"].str.contains("Unnamed")]
return df
clean_df = (df
.rename(columns = {0:"text"})
.pipe(filter_unnamed)
#[['week','released']] = lambda df_:df_["text"].str.split('-', expand = True)
)
The first line of the clean_df chain to rename column 0 works.
The second line removes rows that contain "Unnamed"; it works, but is there a better way than using pipe and a function?
I'm having the most trouble with str.split in the 3rd row (doesn't work, commented out). I tried assign for this and think it should work, but I don't know how to pass in the new column names ("week" and "released") with the str.split function.
Thanks for the help.
I also couldn't figure out how to create two columns in one go from the split... but I was able to do it by splitting twice and accessing parts 1 and 2 in succession (not ideal), df.assign(week = ...[0], released = ...[1]).
Note also I reset the index.
df.assign(week = df[0].str.split(pat = '-', expand=True)[0], released = df[0].str.split(pat = '-', expand=True)[1])[~df[0].str.contains("Unnamed")].reset_index(drop=True).rename(columns = {0: "text"})
I'm sure there's a sleeker way, but this may help.

removing newlines from messy strings in pandas dataframe cells?

I've used multiple ways of splitting and stripping the strings in my pandas dataframe to remove all the '\n'characters, but for some reason it simply doesn't want to delete the characters that are attached to other words, even though I split them. I have a pandas dataframe with a column that captures text from web pages using Beautifulsoup. The text has been cleaned a bit already by beautifulsoup, but it failed in removing the newlines attached to other characters. My strings look a bit like this:
"hands-on\ndevelopment of games. We will study a variety of software technologies\nrelevant to games including programming languages, scripting\nlanguages, operating systems, file systems, networks, simulation\nengines, and multi-media design systems. We will also study some of\nthe underlying scientific concepts from computer science and related\nfields including"
Is there an easy python way to remove these "\n" characters?
EDIT: the correct answer to this is:
df = df.replace(r'\n',' ', regex=True)
I think you need replace:
df = df.replace('\n','', regex=True)
Or:
df = df.replace('\n',' ', regex=True)
Or:
df = df.replace(r'\\n',' ', regex=True)
Sample:
text = '''hands-on\ndev nologies\nrelevant scripting\nlang
'''
df = pd.DataFrame({'A':[text]})
print (df)
A
0 hands-on\ndev nologies\nrelevant scripting\nla...
df = df.replace('\n',' ', regex=True)
print (df)
A
0 hands-on dev nologies relevant scripting lang
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
worked for me.
Source:
https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a
To remove carriage return (\r), new line (\n) and tab (\t)
df = df.replace(r'\r+|\n+|\t+','', regex=True)
in messy data it might to be a good idea to remove all whitespaces df.replace(r'\s', '', regex = True, inplace = True).
df = 'Sarah Marie Wimberly So so beautiful!!!\nAbram Staten You guys look good man.\nTJ Sloan I miss you guys\n'
df = df.replace(r'\\n',' ', regex=True)
This worked for the messy data I had.

Pandas KeyError: value not in index

I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.
Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)
I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())
please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.

Categories

Resources